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Where  necessary  the  underlying  ideas  are  explained  and  the  algorithms  are  given  formally.  It  is  assumed 
that  the  reader  is  able  to  understand  the  given  source  code , it  is  considered  part  of  the  text.  We  use  the 
C++  programming  language  for  low-level  algorithms.  However,  only  a minimal  set  of  features  beyond 
plain  C is  used,  most  importantly  classes  and  templates.  For  material  where  technicalities  in  the  C++ 
code  would  obscure  the  underlying  ideas  we  use  either  pseudocode  or,  with  arithmetical  algorithms,  the 
GP  language.  Appendix  [C]  gives  an  introduction  to  GP. 

Example  computations  are  often  given  with  an  algorithm,  these  are  usually  made  with  the  demo  programs 
referred  to.  Most  of  the  listings  and  figures  in  this  book  were  created  with  these  programs.  A recurring 
topic  is  practical  efficiency  of  the  implementations.  Various  optimization  techniques  are  described  and 
the  actual  performance  of  many  given  implementations  is  indicated. 

The  accompanying  software,  the  FXT  m and  the  hfloat  i22i  libraries,  are  written  for  POSIX  compliant 
platforms  such  as  the  Linux  and  BSD  operating  systems.  The  license  is  the  GNU  General  Public  License 
(GPL),  version  3 or  later,  see  http://www.gnu.org/licenses/gpl.html. 

Individual  chapters  are  self-contained  where  possible  and  references  to  related  material  are  given  where 
needed.  The  symbol  1 $ ’ marks  sections  that  can  be  skipped  at  first  reading.  These  typically  contain 
excursions  or  more  advanced  material. 
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treated  there  must  be  errors  in  this  book.  Corrections  and  suggestions  for  improvement  are  appreciated, 
the  preferred  way  of  communication  is  electronic  mail.  A list  of  errata  is  online  at  http://www.jjj  .de/ 
fxt/#fxtbook 
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Chapter  1 

Bit  wizardry 


We  give  low-level  functions  for  binary  words,  such  as  isolation  of  the  lowest  set  bit  or  counting  all  set 
bits.  Sometimes  the  term  ‘one’  is  used  for  a set  bit  and  ‘zero’  for  an  unset  bit.  Where  it  cannot  cause 
confusion,  the  term  ‘bit’  is  used  for  a set  bit  (as  in  “counting  the  bits  of  a word”). 

The  C-type  unsigned  long  is  abbreviated  as  ulong  as  defined  in  [FXT:  fxttypes.hj.  It  is  assumed  that 
BITS_PER_LONG  reflects  the  size  of  an  unsigned  long.  It  is  defined  in  [FXT:  bits/bitsperlong.h  and 
usually  equals  the  machine  word  size:  32  on  32-bit  architectures,  and  64  on  64-bit  machines.  Further, 
the  quantity  BYTES_PER_LONG  reflects  the  number  of  bytes  in  a machine  word:  it  equals  BITS_PER_LONG 
divided  by  eight.  For  some  functions  it  is  assumed  that  long  and  ulong  have  the  same  number  of  bits. 

Many  functions  will  only  work  on  machines  that  use  two’s  complement,  which  is  used  by  all  of  the  current 
general  purpose  computers  (the  only  machines  using  one’s  complement  appear  to  be  some  successors  of 
the  UNIVAC  system,  see  PM  entry  “UNIVAC  1100/2200  series”]). 

The  examples  of  assembler  code  are  for  the  x86  and  the  AMD64  architecture.  They  should  be  simple 
enough  to  be  understood  by  readers  who  know  assembler  for  any  CPU. 

1.1  Trivia 

1.1.1  Little  endian  versus  big  endian 

The  order  in  which  the  bytes  of  an  integer  are  stored  in  memory  can  start  with  the  least  significant  byte 
( little  endian  machine)  or  with  the  most  significant  byte  ( big  endian  machine).  The  hexadecimal  number 
OxODOCOBOA  will  be  stored  in  the  following  manner  if  memory  addresses  grow  from  left  to  right: 

adr:  z z+1  z+2  z+3 

mem:  0D  0C  OB  0A  //  big  endian 

mem:  0A  OB  0C  0D  //  little  endian 

The  difference  becomes  visible  when  you  cast  pointers.  Let  V be  the  32-bit  integer  with  the  value 
above.  Then  the  result  of  char  c = *(char  *)  (&V) ; will  be  OxOA  (value  modulo  256)  on  a little 
endian  machine  but  OxOD  (value  divided  by  224)  on  a big  endian  machine.  Though  friends  of  big  endian 
sometimes  refer  to  little  endian  as  ‘wrong  endian’,  the  desired  result  of  the  shown  pointer  cast  is  much 
more  often  the  modulo  operation. 

Whenever  words  are  serialized  into  bytes,  as  with  transfer  over  a network  or  to  a disk,  one  will  need  two 
code  versions,  one  for  big  endian  and  one  for  little  endian  machines.  The  C-type  union  (with  words  and 
bytes)  may  also  require  separate  treatment  for  big  and  little  endian  architectures. 

1.1.2  Size  of  pointer  is  not  size  of  int 

If  programming  for  a 32-bit  architecture  (where  the  size  of  int  and  long  coincide),  casting  pointers  to 
integers  (and  back)  will  usually  work.  The  same  code  will  fail  on  64-bit  machines.  If  you  have  to  cast 
pointers  to  an  integer  type,  cast  them  to  a sufficiently  big  type.  For  portable  code  it  is  better  to  avoid 
casting  pointers  to  integer  types. 
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1.1.3  Shifts  and  division 


With  two’s  complement  arithmetic  division  and  multiplication  by  a power  of  2 is  a right  and  left  shift, 
respectively.  This  is  true  for  unsigned  types  and  for  multiplication  (left  shift)  with  signed  types.  Division 
with  signed  types  rounds  toward  zero,  as  one  would  expect,  but  right  shift  is  a division  (by  a power  of  2) 
that  rounds  to  — oo: 
int  a = -1; 

int  c = a » 1;  //  c ==  -1 

int  d = a/  2;  //  d ==  0 


The  compiler  still  uses  a shift  instruction  for  the  division,  but  with  a ‘fix’  for  negative  values: 


9:test.cc  @ int 

10:test.cc  @ { 

285  0003  8B442410 
ll:test.cc  @ 

289  0007  89C1 

290  0009  D1F9 
12:test.cc  0 

293  000b  89C2 

294  OOOd  C1EA1F 

295  0010  01D0 

296  0012  D1F8 


f oo (int  a) 


movl  16("/„esp)  , */,eax  //  move  argument  to  */,eax 
int  s = a >>  1 ; 

movl  "/.eax,y,ecx 
sari  $l,"/,ecx 
int  d = a / 2 ; 

movl  '/,eax , ’/,edx 

shrl  $31,°/0edx  //  fix:  '/,edx=("/,edx<0?l : 0) 

addl  "/,edx,y,eax  //  fix:  add  one  if  a<0 
sari  $l,"/,eax 


For  unsigned  types  the  shift  would  suffice.  One  more  reason  to  use  unsigned  types  whenever  possible. 
The  assembler  listing  was  generated  from  C code  via  the  following  commands: 

# create  assembler  code: 

C++  -S  -f verbose-asm  -g  -02  test.cc  -o  test.s 

# create  asm  interlaced  with  source  lines: 
as  -alhnd  test.s  > test. 1st 


There  are  two  types  of  right  shifts:  a logical  and  an  arithmetical  shift.  The  logical  version  (shrl  in  the 
above  fragment)  always  fills  the  higher  bits  with  zeros,  corresponding  to  division  of  unsigned  types.  The 
arithmetical  shift  (sari  in  the  above  fragment)  fills  in  ones  or  zeros,  according  to  the  most  significant  bit 
of  the  original  word. 

Computing  remainders  modulo  a power  of  2 with  unsigned  types  is  equivalent  to  a bit-and: 

ulong  a = b •/.  32;  //  ==  b & (32-1) 


All  of  the  above  is  done  by  the  compiler’s  optimization  wherever  possible. 

Division  by  (compile  time)  constants  can  be  replaced  by  multiplications  and  shifts.  The  compiler  does  it 
for  you.  A division  by  the  constant  10  is  compiled  to: 

5:test.cc  @ ulong  foo(ulong  a) 

6: test.cc  @ { 

7: test.cc  @ ulong  b = a / 10; 

290  0000  8B442404  movl  4(*/.esp)  ,'/,eax 

291  0004  F7250000  mull  . LC33  //  value  ==  Oxcccccccd 

292  000a  89D0  movl  ytedx,’/,eax 

293  000c  C1E803  shrl  $3,"/.eax 


Therefore  it  is  sometimes  reasonable  to  have  separate  code  branches  with  explicit  special  values.  Similar 
optimizations  can  be  used  for  the  modulo  operation  if  the  modulus  is  a compile  time  constant.  For 
example,  using  modulus  10,000: 


8:test.cc  @ ulong  foo(ulong  a) 


9: test.cc  @ { 
53  0000  8B4C2404 
10:test.cc  @ 

57  0004  89C8 

58  0006  F7250000 

59  000c  89D0 

60  OOOe  C1E80D 

61  0011  69C01027 

62  0017  29C1 

63  0019  89C8 


movl  4(°/,esp)  ,'/,ecx 
ulong  b = a "/.  10000; 

movl  ytecx,’/,eax 

mull  . LC0  //  value  ==  0xdlb71759 

movl  "/,edx,y,eax 

shrl  $13,y,eax 

imull  $10000, yteax,"/,eax 

subl  y.eax.'/.ecx 

movl  y.ecx.’/.eax 


Algorithms  to  replace  divisions  by  a constant  with  multiplications  and  shifts  are  given  in  m , see 
also  |346| . 
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Note  that  the  C standard  leaves  the  behavior  of  a right  shift  of  a signed  integer  as  ‘implementation- 
defined’.  The  described  behavior  (that  a negative  value  remains  negative  after  right  shift)  is  the  default 
behavior  of  many  commonly  used  C compilers. 

1.1.4  A pitfall  (two’s  complement) 


c= 

c= i 

c= 1. 

c= 11 

c= 1.  . 

c= 1.1 

c= 11. 

[ — snip — ] 
c=. 1111111111111.1 
c=. 11111111111111. 
c=. 111111111111111 


-c=llllllllllllllll 
-c=lllllllllllllll . 
-c=llllllllllllll . 1 
-c=llllllllllllll . . 
-c=lllllllllllll.ll 
-0=1111111111111.1. 


[ — snip — ] 
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1 
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Figure  1.1-A:  With  two’s  complement  there  is  one  nonzero  value  that  is  its  own  negative. 


In  two’s  complement  zero  is  not  the  only  number  that  is  equal  to  its  negative.  The  value  with  just 
the  highest  bit  set  (the  most  negative  value)  also  has  this  property.  Figure  1.1-A  (the  output  of  [FXT: 
bits/gotcha-demo.cc  ) shows  the  situation  for  words  of  16  bits.  This  is  why  innocent  looking  code  like 
the  following  can  simply  fail: 

if  ( x<0  ) x = -x; 

//  assume  x positive  here  (WRONG!) 


1.1.5  Another  pitfall  (shifts  in  the  C-language) 

A shift  by  more  than  BITS_PER_LONG— 1 is  undefined  by  the  C-standard.  Therefore  the  following  function 
can  fail  if  k is  zero: 

1 static  inline  ulong  f irst_comb(ulong  k) 

2 //  Return  the  first  combination  of  (i.e.  smallest  word  with)  k bits, 

3 //  i.e.  00 .. 001111 .. 1 (k  low  bits  set) 

4 { 

5 ulong  t = ~0UL  » ( BITS_PER_LONG  - k ) ; 

6 return  t ; 

7 } 

Compilers  usually  emit  just  a shift  instruction  which  on  certain  CPUs  does  not  give  zero  if  the  shift  is 
equal  to  or  greater  than  BITS_PER_LONG.  This  is  why  the  line 

if  ( k==0  ) t = 0;  //  shift  with  BITS_PER_LONG  is  undefined 

has  to  be  inserted  just  before  the  return  statement. 


1.1.6  Shortcuts 

Test  whether  at  least  one  of  a and  b equals  zero  with 
if  ( ! (a  &&  b)  ) 

This  works  for  both  signed  and  unsigned  integers.  Check  whether  both  are  zero  with 
if  ( (a|b)==0  ) 

This  obviously  generalizes  for  several  variables  as 

if  ( (a I b I c I . . I z)==0  ) 

Test  whether  exactly  one  of  two  variables  is  zero  using 
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if  ( ( ! a)  - ( ! b)  ) 

1.1.7  Average  without  overflow 

A routine  for  the  computation  of  the  average  (x  + y)/ 2 of  two  arguments  x and  y is  [FXT:  bits/average. h 

1 static  inline  ulong  average (ulong  x,  ulong  y) 

2 //  Return  floor ( (x+y)/2  ) 

3 //  Use:  x+y  ==  ((x&y)«l)  + (x~y) 

4 //  that  is:  sum  ==  carries  + sum_without_carries 

5 { 

6 return  (x  & y)  + ((x  “ y)  » 1); 

7 } 

The  function  gives  the  correct  value  even  if  ( x + y)  does  not  fit  into  a machine  word.  If  it  is  known 
that  x > y,  then  we  can  use  the  simpler  statement  return  y+(x-y)/2.  The  following  version  rounds  to 
infinity: 

1 static  inline  ulong  ceil_average (ulong  x,  ulong  y) 

2 //  Use:  x+y  ==  ((x|y)«l)  - (x~y) 

3 //  ceil_average(x,y)  ==  average(x.y)  + ((x~y)&l)) 

4 { 

5 return  (x  I y)  - ((x  ~ y)  » 1); 

6 } 

1.1.8  Toggling  between  values 

To  toggle  an  integer  x between  two  values  a and  b,  use: 

pre-calculate : t = a * b; 

toggle:  x "=  t;  //a  < — > b 

The  equivalent  trick  for  floating-point  types  is 

pre-calculate:  t = a + b; 
toggle:  x = t - x; 

Here  an  overflow  could  occur  with  a and  b in  the  allowed  range  if  both  are  close  to  overflow. 

1.1.9  Next  or  previous  even  or  odd  value 

Compute  the  next  or  previous  even  or  odd  value  via  [FXT:  bits/evenodd.h  : 

1 static  inline  ulong  next .even (ulong  x)  { return  x+2-(x&l);  } 

2 static  inline  ulong  prev_even (ulong  x)  { return  x-2+(x&l);  } 

3 

4 static  inline  ulong  next _odd (ulong  x)  { return  x+l+(x&l) ; } 

5 static  inline  ulong  prev_odd (ulong  x)  { return  x-l-(xftl);  } 

The  following  functions  return  the  unmodified  argument  if  it  has  the  required  property,  else  the  nearest 
such  value: 

1 static  inline  ulong  nextO_even (ulong  x)  { return  x+(x&l) ; } 

2 static  inline  ulong  prevO_even (ulong  x)  { return  x-(xftl);  } 

3 

4 static  inline  ulong  nextO_odd (ulong  x)  { return  x+l-(x&l);  } 

5 static  inline  ulong  prevO_odd (ulong  x)  { return  x-l+(x&l);  } 

Pedro  Gimeno  gives  [priv.  comm.]  the  following  optimized  versions: 

1 static  inline  ulong  next_even(ulong  x)  { return  (x|l)+l;  I 

2 static  inline  ulong  prev_even (ulong  x)  { return  (x-l)ft'l;  } 

3 

4 static  inline  ulong  next _odd (ulong  x)  { return  (x+l)|l;  } 

5 static  inline  ulong  prev_odd (ulong  x)  { return  (x&~l)-l;  } 

1 static  inline  ulong  nextO_even (ulong  x)  { return  (x+l)&~l;  } 

2 static  inline  ulong  prevO_even (ulong  x)  { return  x&~l;  } 

3 

4 static  inline  ulong  nextO_odd (ulong  x)  { return  x 1 1 ; } 

5 static  inline  ulong  prevO_odd (ulong  x)  { return  (x— 1 ) I 1 ; } 
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1.1.10  Integer  versus  float  multiplication 

The  floating-point  multiplier  gives  the  highest  bits  of  the  product.  Integer  multiplication  gives  the 
result  modulo  2b  where  b is  the  number  of  bits  of  the  integer  type  used.  As  an  example  we  square  the 
number  111111111  using  a 32-bit  integer  type  and  floating-point  types  with  24-bit  and  53-bit  mantissa 
(significand): 

a=  111111111  //  assignment 

a*a  ==  12345678987654321  //  true  result 

a*a  ==  1653732529  //  result  with  32-bit  integer  multiplication 

(a*a)%(2**32)  ==  1653732529  //  ...  which  is  modulo  (2**bits_per_int) 

a*a  ==  1 . 2345679481405440e+16  //  result  with  float  multiplication  (24  bit  mantissa) 
a*a  ==  1 . 2345678987654320e+16  //  result  with  float  multiplication  (53  bit  mantissa) 

1.1.11  Double  precision  float  to  signed  integer  conversion 

Conversion  of  double  precision  floats  that  have  a 53-bit  mantissa  to  signed  integers  via  m p.52-53] 

1 #def ine  D0UBLE2INT(i,  d)  { double  t = ((d)  + 6755399441055744.0);  i = *((int  *)(&t));  } 

2 double  x = 123.0; 

3 int  i ; 

4 D0UBLE2INT(i , x)  ; 

can  be  a faster  alternative  to 

1 double  x = 123.0; 

2 int  i = x; 

The  constant  used  is  6755399441055744  = 252  + 251.  The  method  is  machine  dependent  as  it  relies  on  the 
binary  representation  of  the  floating-point  mantissa.  Here  it  is  assumed  that,  the  floating-point  number 
has  a 53-bit  mantissa  with  the  most  significant  bit  (that  is  always  one  with  normalized  numbers)  omitted, 
and  that  the  address  of  the  number  points  to  the  mantissa. 

1.1.12  Optimization  considerations 

Never  assume  that  some  code  is  the  ‘fastest  possible’.  There  is  always  another  trick  that  can  still  improve 
performance.  Many  factors  can  have  an  influence  on  performance,  like  the  number  of  CPU  registers  or 
cost  of  branches.  Code  that  performs  well  on  one  machine  might  perform  badly  on  another.  The  old 
trick  to  swap  variables  without  using  a temporary  is  pretty  much  out  of  fashion  today: 


// 

a=0 , 

b=0 

a=0 , b=l 

a=l , b=0 

a=l, 

b=l 

a ~=  b; 

// 

0 

0 

1 1 

1 0 

0 

1 

b ~=  a; 

// 

0 

0 

1 0 

1 1 

0 

1 

a ~=  b; 

// 

0 

0 

1 0 

0 1 

1 

1 

//  equivalent 

to : 

tmp  = 

a;  a = b; 

b = tmp; 

However,  under  some  conditions  (like  extreme  register  pressure)  it  may  be  the  way  to  go.  Note  that  if 
both  operands  are  identical  (memory  locations)  then  the  result  is  zero. 

The  only  way  to  find  out  which  version  of  a function  is  faster  is  to  actually  do  benchmarking  (timing) . The 
performance  does  depend  on  the  sequence  of  instructions  surrounding  the  machine  code,  assuming  that 
all  of  these  low-level  functions  get  inlined.  Studying  the  generated  CPU  instructions  helps  to  understand 
what  happens,  but  can  never  replace  benchmarking.  This  means  that  benchmarks  for  just  the  isolated 
routine  can  at  best  give  a rough  indication.  Test  your  application  using  different  versions  of  the  routine 
in  question. 

Never  ever  delete  the  unoptimized  version  of  some  code  fragment  when  introducing  a streamlined  one. 
Keep  the  original  in  the  source.  If  something  nasty  happens  (think  of  low  level  software  failures  when 
porting  to  a different  platform) , you  will  be  very  grateful  for  the  chance  to  temporarily  resort  to  the  slow 
but  correct  version. 

Study  the  optimization  recommendations  for  your  CPU  (like  [IT]  and  12]  for  the  AMD64,  see  also  [144]  1. 
You  can  also  learn  a lot  from  the  documentation  for  other  architectures. 
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Proper  documentation  is  an  absolute  must  for  optimized  code.  Always  assume  that  nobody  will  under- 
stand the  code  without  comments.  You  may  not  be  able  to  understand  uncommented  code  written  by 
yourself  after  enough  time  has  passed. 

1.2  Operations  on  individual  bits 

1.2.1  Testing,  setting,  and  deleting  bits 

The  following  functions  should  be  self-explanatory.  Following  the  spirit  of  the  C language  there  is  no 
check  whether  the  indices  used  are  out  of  bounds.  That  is,  if  any  index  is  greater  than  or  equal  to 
BITS_PER_LONG,  the  result  is  undefined  [FXT:  bits /bittest . h | : 

1 static  inline  ulong  test_bit (ulong  a,  ulong  i) 

2 //  Return  zero  if  bit  [i]  is  zero, 

3 //  else  return  one-bit  word  with  bit  [i]  set. 

4 { 

5 return  (a  & (1UL  « i)); 

6 } 

The  following  version  returns  either  zero  or  one: 

1 static  inline  bool  test_bit01 (ulong  a,  ulong  i) 

2 //  Return  whether  bit  [i]  is  set. 

3 { 

4 return  ( 0 !=  test_bit(a,  i)  ); 

5 } 

Functions  for  setting,  clearing,  and  changing  a bit  are: 

1 static  inline  ulong  set_bit (ulong  a,  ulong  i) 

2 //  Return  a with  bit  [i]  set . 

3 { 

4 return  (a  I (1UL  « i)); 

5 } 

1 static  inline  ulong  clear_bit (ulong  a,  ulong  i) 

2 //  Return  a with  bit [i]  cleared. 

3 { 

4 return  (a  & '(1UL  <<  i)); 

5 } 

1 static  inline  ulong  change_bit (ulong  a,  ulong  i) 

2 //  Return  a with  bit [i]  changed. 

3 { 

4 return  (a  ~ (1UL  « i)); 

5 } 

1.2.2  Copying  a bit 

To  copy  a bit  from  one  position  to  another,  we  generate  a one  if  the  bits  at  the  two  positions  differ.  Then 
an  XOR  changes  the  target  bit  if  needed  [FXT:  bits/bitcopy.h  : 

1 static  inline  ulong  copy _bit (ulong  a,  ulong  isrc,  ulong  idst) 

//  Copy  bit  at  [isrc]  to  position  [idst] . 

//  Return  the  modified  word. 

{ 

ulong  x = ((a»isrc)  “ (a»idst))  & 1;  //  one  if  bits  differ 

a ~=  (x«idst)  ; //  change  if  bits  differ 
return  a; 

} 

The  situation  is  more  tricky  if  the  bit  positions  are  given  as  (one  bit)  masks: 

static  inline  ulong  mask_copy_bit (ulong  a,  ulong  msrc,  ulong  mdst) 

//  Copy  bit  according  at  src-mask  (msrc) 

//  to  the  bit  according  to  the  dest-mask  (mdst) . 

//  Both  msrc  and  mdst  must  have  exactly  one  bit  set. 

{ 

ulong  x = mdst ; 

if  ( msrc  & a ) x=0;  //  zero  if  source  bit  set 

x “=  mdst;  //  ==mdst  if  source  bit  set,  else  zero 
a &=  'mdst;  //  clear  dest  bit 
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10  a |=  x; 

11  return  a; 

12  } 

The  compiler  generates  branch-free  code  as  the  conditional  assignment  is  compiled  to  a cmov  (conditional 
move)  assembler  instruction.  If  one  or  both  masks  have  several  bits  set,  the  routine  will  set  all  bits  of 
mdst  if  any  of  the  bits  in  msrc  is  one,  or  else  clear  all  bits  of  mdst. 


1.2.3  Swapping  two  bits 

A function  to  swap  two  bits  of  a word  is  [FXT:  bits/bitswap.h|: 

1  static  inline  ulong  bit_swap(ulong  a,  ulong  kl , ulong  k2) 

//  Return  a with  bits  at  positions  [kl]  and  [k2]  swapped. 

//  kl==k2  is  allowed  (a  is  unchanged  then) 

{ 

ulong  x = ((a»kl)  * (a»k2))  k 1;  //  one  if  bits  differ 

a ~=  (x«k2)  ; //  change  if  bits  differ 
a “=  (x«kl)  ; //  change  if  bits  differ 
return  a; 

} 

If  it  is  known  that  the  bits  do  have  different  values,  the  following  routine  should  be  used: 

1 static  inline  ulong  bit_swap_01 (ulong  a,  ulong  kl,  ulong  k2) 

2 //  Return  a with  bits  at  positions  [kl]  and  [k2]  swapped. 

3 //  Bits  must  have  different  values  (!) 

4 //  (i.e.  one  is  zero,  the  other  one) 

5 //  kl==k2  is  allowed  (a  is  unchanged  then) 

6 { 

7 return  a * ( (lUL«kl)  * (lUL«k2)  ); 

8 } 


1.3  Operations  on  low  bits  or  blocks  of  a word 


The  underlying  idea  of  functions  operating  on  the  lowest  set  bit  is  that  addition  and  subtraction  of  1 always 
changes  a burst  of  bits  at  the  lower  end  of  the  word.  The  functions  are  given  in  [FXT:  bits/bitlow.h  . 

1.3.1  Isolating,  setting,  and  deleting  the  lowest  one 

The  lowest  one  (set  bit)  is  isolated  via 

1 static  inline  ulong  lowest_one (ulong  x) 

2 //  Return  word  where  only  the  lowest  set  bit  in  x is  set . 

3 //  Return  0 if  no  bit  is  set . 

4 { 

5 return  x & -x;  //  use:  -x  ==  ~x  + 1 

6 } 

The  lowest  zero  (unset  bit)  is  isolated  using  the  equivalent  of  lowest_one(  ~x  ): 

1 static  inline  ulong  lowest_zero (ulong  x) 

2 //  Return  word  where  only  the  lowest  unset  bit  in  x is  set . 

3 //  Return  0 if  all  bits  are  set. 

4 { 

5 x = ~x; 

6 return  x & -x; 

7 } 

Alternatively,  we  can  use  either  of 

return  (x  ' (x+1))  k ~x; 
return  ((x  (x+1))  » 1 ) + 1; 


1.3:  Operations  on  low  bits  or  blocks  of  a word 
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The  sequence  of  returned  values  for  a;  = 0,  1,  . . . is  the  highest  power  of  2 that  divides  x + 1,  entry 
A006519  in  13l2j  (see  also  entry  A001511): 

lowest_zero (x) 

1 

1. 

1 

1.  . 

1 

1. 

1 

1.  . . 

1 

1. 

1 

The  lowest  set  bit  in  a word  can  be  cleared  by 

1 static  inline  ulong  clear_lowest_one(ulong  x) 

2 //  Return  word  where  the  lowest  bit  set  in  x is  cleared. 

3 //  Return  0 for  input  ==  0. 

4 { 

5 return  x & (x-1) ; 

6 } 

The  lowest  unset  bit  can  be  set  by 

1 static  inline  ulong  set_lowest_zero (ulong  x) 

2 //  Return  word  where  the  lowest  unset  bit  in  x is  set. 

3 //  Return  ~0  for  input  ==  "0. 

4 { 

5 return  x I (x+1) ; 

6 } 


x 

0 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


. .1 

.1. 

. 11 

1.  . 
1.1 
11. 
.111 
1.  . . 
1.  .1 

1.1. 


1.3.2  Computing  the  index  of  the  lowest  one 

We  compute  the  index  (position)  of  the  lowest  bit  with  an  assembler  instruction  if  available  [FXT: 
bits/bitasm-amd64.h  : 

1 static  inline  ulong  asm_bsf (ulong  x) 

2 //  Bit  Scan  Forward 

3 { 

4 asm  ("bsfq  7,0,  7.0"  : "=r"  (x)  : "0"  (x)); 

5 return  x; 

6 } 

Without  the  assembler  instruction  an  algorithm  that  involves  O (log2  BITS_PER_L0NG)  operations  can  be 
used.  The  function  can  be  implemented  as  follows  (suggested  by  Nathan  Bullock  [priv.  comm.],  64-bit 
version)  [FXT:  bits/bitlow.h  : 

1 static  inline  ulong  lowest_one_idx (ulong  x) 


2 

// 

Return  index  of  lowest  bit  set . 

3 

//  Examples: 

4 

// 

***1  — > o 

5 

// 

**10  — > 1 

6 

// 

*100  — > 2 

7 

// 

Return  0 (also)  if  no  bit  is  set. 

8 

{ 

9 

ulong  r = 0; 

10 

x &=  -x;  //  isolate  lowest  bit 

11 

if  ( x & Oxf f f f f f f f 00000000UL  ) 

r 

+=  32 

12 

if  ( x & Oxf f ff OOOOf f f f 0000UL  ) 

r 

+=  16 

13 

if  ( x & Oxf f OOf f OOf f OOf f 00UL  ) 

r 

+=  8; 

14 

if  ( x & Oxf  Of  Of  Of  Of  Of  Of  Of  0UL  ) 

r 

+=  4; 

15 

if  ( x & OxccccccccccccccccUL  ) 

r 

+=  2; 

16 

if  ( x k OxaaaaaaaaaaaaaaaaUL  ) 

r 

+=  1; 

17 

return  r; 

18 

} 

The  function  returns  zero  for  two  inputs,  one  and  zero.  If  a special  value  for  the  input  zero  is  needed,  a 
statement  as  the  following  should  be  added  as  the  first  line  of  the  function: 

if  ( l>=x  ) return  x-1;  //  0 if  1,  '0  if  0 

The  following  function  returns  the  parity  of  the  index  of  the  lowest  set  bit  in  a binary  word 

1 static  inline  ulong  lowest_one_idx_parity (ulong  x) 

2 { 

3 x &=  -x; 


//  isolate  lowest  bit 
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4 return  0 ! = (x  & OxaaaaaaaaaaaaaaaaUL) ; 

5 } 

The  sequence  of  values  for  x = 0, 1, 2, . . . is 

0010001010100010001000101010001010100010101000100010001010100010. . . 


This  is  the  complement  of  the  period- doubling  sequence , entry  A035263  in  13121 . 
|page  735]  for  the  connection  to  the  towers  of  Hanoi  puzzle. 


See  section  38.5.1  on 


1.3.3  Isolating  blocks  of  zeros  or  ones  at  the  low  end 

Isolate  the  burst  of  low  ones  as  follows  [FXT:  bits/bitlow.h  : 

1 static  inline  ulong  low_ones (ulong  x) 

2 //  Return  word  where  all  the  (low  end)  ones  are  set . 

3 //  Example:  01011011  — > 00000011 

4 //  Return  0 if  lowest  bit  is  zero: 

5 //  10110110  — > 0 

6 { 

7 x = ~x; 

8 x &=  -x; 

9 — x; 

10  return  x; 

11  } 

The  isolation  of  the  low  zeros  is  slightly  cheaper: 

1 static  inline  ulong  low_zeros (ulong  x) 

2 //  Return  word  where  all  the  (low  end)  zeros  are  set . 

3 //  Example:  01011000  — > 00000111 

4 //  Return  0 if  all  bits  are  set. 

5 { 

6 x St=  -x; 

7 — x; 

8 return  x; 

9 } 

The  lowest  block  of  ones  (which  may  have  zeros  to  the  right  of  it)  can  be  isolated  by 

1 static  inline  ulong  lowest_block(ulong  x) 


2 

// 

Isolate  lowest  block  of  ones 

3 

// 

e.g.  : 

4 

// 

x = *****011100 

5 

// 

1 = 00000000100 

6 

// 

y = *****100000 

7 

// 

x~y  = 00000111100 

8 

// 

ret  = 00000011100 

9 

{ 

10  ulong  1 = x k -x;  //  lowest  bit 

11  ulong  y = x + 1; 

12  x ~=  y; 

13  return  x & (x>>1) ; 

14  > 


1.3.4  Creating  a transition  at  the  lowest  one 

Use  the  following  routines  to  set  a rising  or  falling  edge  at  the  position  of  the  lowest  set  bit  [FXT: 
bits/bitlow-edge.h  : 

static  inline  ulong  lowest_one_10edge (ulong  x) 

//  Return  word  where  all  bits  from  (including)  the 
//  lowest  set  bit  to  most  significant  bit  are  set. 

//  Return  0 if  no  bit  is  set . 

//  Example:  00110100  — > 11111100 

{ 

return  ( x I -x  ) ; 

} 

static  inline  ulong  lowest_one_01edge (ulong  x) 

//  Return  word  where  all  bits  from  (including)  the 
//  lowest  set  bit  to  the  least  significant  are  set. 

//  Return  0 if  no  bit  is  set . 

//  Example:  00110100  — > 00000111 


1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

4 

5 
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4 
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6 
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4 
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2 
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5 
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{ 

if  ( 0==x  ) return  0 ; 
return  x~(x-l); 

} 


1.3.5  Isolating  the  lowest  run  of  matching  bits 


Let  x = *0W  and 


y = *1W,  the  following  function  computes  W: 


static  inline 

{ 

x ~=  y; 
x &=  -x; 
x -=  1; 
x &=  y; 
return  x; 

} 


ulong  low_match(ulong  x,  ulong  y) 

//  bit-wise  difference 

//  lowest  bit  that  differs  in  both  words 
//  mask  that  covers  equal  bits  at  low  end 
//  isolate  matching  bits 


1.4  Extraction  of  ones,  zeros,  or  blocks  near  transitions 

We  give  functions  for  the  creation  or  extraction  of  bit-blocks  and  the  isolation  of  values  near  transitions. 
A transition  is  a place  where  adjacent  bits  have  different  values.  A block  is  a group  of  adjacent  bits  of 
the  same  value. 


1.4.1  Creating  blocks  of  ones 

The  following  functions  are  given  in  [FXT:  bits/bitblock.h  . 

static  inline  ulong  bit_block(ulong  p,  ulong  n) 

//  Return  word  with  length-n  bit  block  starting  at  bit  p set. 

//  Both  p and  n are  effectively  taken  modulo  BITS_PER_LONG . 

{ 

ulong  x = (lULCCn)  - 1; 
return  x « p; 

} 

A version  with  indices  wrapping  around  is 

static  inline  ulong  cyclic_bit_block(ulong  p,  ulong  n) 

//  Return  word  with  length-n  bit  block  starting  at  bit  p set. 

//  The  result  is  possibly  wrapped  around  the  word  boundary. 

II  Both  p and  n are  effectively  taken  modulo  BITS_PER_LONG . 

{ 

ulong  x = (lULCCn)  - 1; 

return  (x«p)  I (x>>(BITS_PER_LONG-p) ) ; 

} 

1.4.2  Finding  isolated  ones  or  zeros 

The  following  functions  are  given  in  [FXT:  bits/bit-isolate.h  : 

static  inline  ulong  single_ones (ulong  x) 

//  Return  word  with  only  the  isolated  ones  of  x set. 

{ 

return  x & ~(  (x«l)  I (x»l)  ); 

} 

We  can  assume  a word  is  embedded  in  zeros  or  ignore  the  bits  outside  the  word: 

static  inline  ulong  single_zeros_xi (ulong  x) 

II  Return  word  with  only  the  isolated  zeros  of  x set. 

{ 

return  single_ones(  ~x  ) ; //  ignore  outside  values 

} 

static  inline  ulong  single_zeros (ulong  x) 

//  Return  word  with  only  the  isolated  zeros  of  x set. 

{ 

return  ~x  & ( (x«l)  & (x»l)  ) ; //  assume  outside  values  ==  0 

} 
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1 static  inline  ulong  single_values (ulong  x) 

2 //  Return  word  where  only  the  isolated  ones  and  zeros  of  x are  set . 

3 { 

4 return  (x  ~ (x«l))  & (x  (x»l));  //  assume  outside  values  ==  0 

5 } 

1 static  inline  ulong  single_values_xi (ulong  x) 

2 //  Return  word  where  only  the  isolated  ones  and  zeros  of  x are  set . 

3 { 

4 return  single_ones (x)  I single_zeros_xi (x) ; //  ignore  outside  values 

5 } 


1.4.3  Isolating  single  ones  or  zeros  at  the  word  boundary 

1 static  inline  ulong  border_ones (ulong  x) 

2 //  Return  word  where  only  those  ones  of  x are  set  that  lie  next  to  a zero . 

3 { 

4 return  x & ~ ( (x«l)  & (x»l)  ); 

5 } 

1 static  inline  ulong  border_values (ulong  x) 

2 //  Return  word  where  those  bits  of  x are  set  that  lie  on  a transition. 

3 { 

4 return  (x  ~ (x«l))  I (x  ~ (x»l)); 

5 } 


1.4.4  Isolating  transitions 

1 static  inline  ulong  high_border_ones (ulong  x) 

2 //  Return  word  where  only  those  ones  of  x are  set 

3 //  that  lie  right  to  (i.e.  in  the  next  lower  bin  of)  a zero. 

4 { 

5 return  x & ( x (x»l)  ) ; 

6 } 

1 static  inline  ulong  low_border_ones (ulong  x) 

2 //  Return  word  where  only  those  ones  of  x are  set 

3 //  that  lie  left  to  (i.e.  in  the  next  higher  bin  of)  a zero. 

4 { 

5 return  x & ( x (x«l)  ) ; 

6 } 


1.4.5  Isolating  ones  or  zeros  at  block  boundaries 

1 static  inline  ulong  block_border_ones (ulong  x) 

2 //  Return  word  where  only  those  ones  of  x are  set 

3 //  that  are  at  the  border  of  a block  of  at  least  2 bits. 

4 { 

5 return  x & ( (x<<1)  “ (x»l)  ) ; 

6 } 

1 static  inline  ulong  low_block_border_ones (ulong  x) 

2 //  Return  word  where  only  those  bits  of  x are  set 

3 //  that  are  at  left  of  a border  of  a block  of  at  least  2 bits. 

4 { 

5 ulong  t = x k ( (x«l)  ' (x»l)  );  //  block_border_ones () 

6 return  t & (x>>1) ; 

7 } 

1 static  inline  ulong  high_block_border_ones (ulong  x) 

2 //  Return  word  where  only  those  bits  of  x are  set 

3 //  that  are  at  right  of  a border  of  a block  of  at  least  2 bits. 

4 { 

5 ulong  t = x & ( (x«l)  ' (x>>1)  );  //  block_border_ones () 

6 return  t & (x<<1) ; 

7 } 

1 static  inline  ulong  block_ones (ulong  x) 

2 //  Return  word  where  only  those  bits  of  x are  set 

3 //  that  are  part  of  a block  of  at  least  2 bits. 

4 { 

5 return  x & ( (x<<1)  I (x»l)  ) ; 

6 } 


1.5:  Computing  the  index  of  a single  set  bit 
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1.5  Computing  the  index  of  a single  set  bit 


In  the  function  lowest_one_idx()  given  in  section  1.3.2  on  page  9 we  first  isolated  the  lowest  one  of  a 
word  x by  first  setting  x&=-x.  At  this  point,  x contains  just  one  set  bit  (or  x==0).  The  following  lines 
in  the  routine  compute  the  index  of  the  only  bit  set.  This  section  gives  some  alternative  techniques  to 
compute  the  index  of  the  one  in  a single-bit  word. 


1.5.1  Cohen’s  trick 


modulus  m= 

11 

k = 

0 

1 

2 3 

4 

5 

6 7 

mt  [k]  = 

0 

0 

1 8 
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bit 

= = 
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1.  = 

2 

x 
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lookup  = 1 
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bit 

= = 
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1. . = 

4 
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{ m= 

4 ==> 

lookup  = 2 
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= = 
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x= 

1. . . = 

8 
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{ m= 

8 ==> 

lookup  = 3 

Lowest 

bit 

= = 
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x= 

. . .1 = 

16 
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{ m= 

5 ==> 

lookup  = 4 
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bit 

== 
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. .1 = 

32 

X 

{ m= 

10  ==> 

lookup  = 5 

Lowest 

bit 

= = 

6: 
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.1 = 

64 

X 

{ m= 
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7: 

x= 

1 = 

128 

X 

{ m= 

7 ==> 

lookup  = 7 

Figure  1.5-A:  Determination  of  the  position  of  a single  bit  with  8-bit  words. 


A nice  trick  is  presented  in  [1110; : for  iV-bit  words  find  a number  m such  that  all  powers  of  2 are  different 
modulo  m.  That  is,  the  (multiplicative)  order  of  2 modulo  m must  be  greater  than  or  equal  to  TV.  We 
use  a table  mt  []  of  size  m that  contains  the  power  of  2:  mt  [(2**j)  mod  m]  = j for  j > 0.  To  look  up 
the  index  of  a one-bit-word  x it  is  reduced  modulo  m and  mt  [x]  is  returned. 

We  demonstrate  the  method  for  TV  = 8 where  m = 11  is  the  smallest  number  with  the  required  property. 
The  setup  routine  for  the  table  is 


1 const  ulong  m = 11;  //  the  modulus 

2 ulong  mt[m+l]  ; 

3 static  void  mt_setup() 

4 { 

5 mt  [0]  = 0;  //  special  value  for  the  zero  word 

6 ulong  t = 1 ; 

7 for  (ulong  i=l;  i<m;  ++i) 

8 { 

9 mt  [t]  = i— 1 ; 

10  t *=  2; 

11  if  ( t>=m  ) t -=  m;  //  modular  reduction 

12  } 

13  } 

The  entry  in  mt  [0]  will  be  accessed  when  the  input  is  the  zero  word.  We  can  use  any  value  to  be  returned 
for  input  zero.  Here  we  simply  use  zero  to  always  have  the  same  return  value  as  with  lowest_one_idx() . 
The  index  can  be  computed  by 


1 static  inline  ulong  m_lowest_one_idx (ulong  x) 

2 { 

3 x &=  -x;  //  isolate  lowest  bit 

4 x "/,=  m;  //  power  of  2 modulo  m 

5 return  mt [x] ; //  lookup 

6 } 


The  code  is  given  in  the  program  [FXT:  bits/modular-lookup-demo.cc|,  the  output  with  TV 
for  size)  is  shown  in  figure  1.5-A  The  following  moduli  m(TV)  can  be  used  for  TV-bit  words: 


N: 


8 16 
11  19 


32 

37 


64 

67 


128 

131 


256 

269 


512 

523 


1024 

1061 


8 (edited 


The  modulus  m(TV)  is  the  smallest  prime  greater  than  TV  such  that  2 is  a primitive  root  modulo  to(TV). 
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Figure  1.5-B:  Computing  the  position  of  the  single  set  bit  in  8-bit  words  with  a De  Bruijn  sequence. 


1.5.2  Using  De  Bruijn  sequences 


The  following  method  (given  in  12281 ) is  even  more  elegant.  It  uses  binary  De  Bruijn  sequences  of  size  N. 
A binary  De  Bruijn  sequence  of  length  2N  contains  all  binary  words  of  length  N,  see  section 
|page  864}  These  are  the  sequences  for  32  and  64  bit,  as  binary  words: 


41.1  on 


#if  BITS_PER_LONG  ==  32 
const  ulong  db  = 0x4653ADFUL; 

//  ==  00000100011001010011101011011111 
const  ulong  s = 32-5; 

const  ulong  db  = 0x218A392CD3D5DBFUL ; 

//  ==  0000001000011000101000111001001011001101001111010101110110111111 
const  ulong  s = 64-6; 

#endif 


Let  Wi  be  the  i- th  sub- word  from  the  left  (high  end).  We  create  a table  such  that  the  entry  with  index 
Wi  points  to  v. 


1 ulong  dbt [BITS_PER_L0NG] ; 

2 static  void  dbt_setup() 

3 { 

4 for  (ulong  i=0;  i<BITS_PER_L0NG;  ++i)  dbt  [ (db<<i)»s  ] = i; 

5 } 


The  computation  of  the  index  involves  a multiplication  and  a table  lookup: 


1 static  inline  ulong  db_lowest_one_idx (ulong  x) 

{ 

x &=  -x;  //  isolate  lowest  bit 

x *=  db;  //  multiplication  by  a power  of  2 is  a shift 
x »=  s;  //  use  log_2(BITS_PER_L0NG)  highest  bits 
return  dbt  [x] ; / / lookup 

} 


The  used  sequences  must  start  with  at  least  log2(iVj  — 1 zeros  because  in  the  line  x *=  db  the  word  x 
is  shifted  (not  rotated).  The  code  is  given  in  the  demo  [FXT:  bits/debruijn-lookup-demo.ccj,  the  output 


with  N = 8 (edited  for  size,  dots  denote  zeros)  is  shown  in  figure  1.5-B 


1.5.3  Using  floating-point  numbers 

Floating-point  numbers  are  normalized  so  that  the  highest  bit  in  the  mantissa  is  set.  Therefore  if  we 
convert  an  integer  into  a float,  the  position  of  the  highest  set  bit  can  be  read  off  the  exponent.  By  isolating 
the  lowest  bit  before  that  operation,  the  index  can  be  found  with  the  same  trick.  However,  the  conversion 
between  integers  and  floats  is  usually  slow.  Further,  the  technique  is  highly  machine  dependent. 


1.6  Operations  on  high  bits  or  blocks  of  a word 

For  functions  operating  on  the  highest  bit  there  is  no  method  as  trivial  as  shown  for  the  lower  end  of  the 
word.  With  a bit-reverse  CPU-instruction  available  life  would  be  significantly  easier.  However,  almost 
no  CPU  seems  to  have  it. 


1.6:  Operations  on  high  bits  or  blocks  of  a word 
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1.6.1  Isolating  the  highest  one  and  finding  its  index 


1111 llll.ill 

1 

llllllllllllllll 

lllllllllllllllll 

15 

::::::::::::::::::::::::::::: iii 

i 

i 

liiiiiiiiiiiiiiiiiiiiiiiiiiiiin 

0 

Ill 

1111 1111.11. 

1. . . 

1111 11111111 


llllllllllllllll 

1 

1 1111 1111.111 


0xf0f7  ==  word 

highest_one 

highest _one_0 ledge 

highest _one_10edge 

highest _one_idx 

low_zeros 

low_ones 

lowest_one 

lowest _one_0 ledge 

lowest_one_10edge 

lowest _one_idx 

lowest_block 

clear_lowest_one 

lowest_zero 

set_lowest_zero 

high_ones 

high_zeros 

highest_zero 

set_highest_zero 


llllllllllllllll 1111 1.  . . 

1 

11111111111111111111111111111111 

1 

31 

Ill 


1.  . . 

1111 

11111111111111111111111111111... 

3 

1.  . . 

llllllllllllllll 1111 

1 

llllllllllllllll 1111 1.  .1 

llllllllllllllll 


1 

lllllllllllllllll. . .1111 1.  . . 


0xffff0f08  ==  word 

highest_one 

highest _one_0 ledge 

highest _one_10edge 

highest _one_idx 

low_zeros 

low_ones 

lowest_one 

lowest _one_0 ledge 

lowest_one_10edge 

lowest _one_idx 

lowest_block 

clear_lowest_one 

lowest_zero 

set_lowest_zero 

high_oii.es 

high_zeros 

highest_zero 

set_highest_zero 


Figure  1.6-A:  Operations  on  the  highest  and  lowest  bits  (and  blocks)  of  a binary  word  for  two  different 


32-bit  input  words.  Dots  denote  zeros. 


Isolation  of  the  highest  set  bit  is  easy  if  a bit-scan  instruction  is  available  [FXT:  bits/bitasm-i386.h 

1 static  inline  ulong  asm_bsr (ulong  x) 

2 //  Bit  Scan  Reverse 

3 { 

4 asm  ("bsrl  7,0,  7,0”  : "=r"  (x)  : "0"  (x)); 

5 return  x; 

6 } 

Without  a bit-scan  instruction,  we  use  the  auxiliary  function  [FXT:  bits /bithigh-edge. h 

1 static  inline  ulong  highest_one_01edge (ulong  x) 

2 //  Return  word  where  all  bits  from  (including)  the 

3 //  highest  set  bit  to  bit  0 are  set . 

4 //  Return  0 if  no  bit  is  set . 

5 { 

6 x |=  x»l ; 

7 x |=  x>>2; 

8 x |=  x>>4; 

9 x |=  x»8 ; 

10  x |=  x>>16; 

11  #if  BITS_PER_LONG  >=  64 

12  x |=  x»32 ; 

13  #endif 

14  return  x; 

15  } 

The  resulting  code  is  [FXT:  bits /bithigh. h 
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1 static  inline  ulong  highest_one(ulong  x) 

2 //  Return  word  where  only  the  highest  bit  in  x is  set . 

3 //  Return  0 if  no  bit  is  set . 

4 { 

5 #if  defined  BITS_USE_ASM 

6 if  ( 0==x  ) return  0 ; 

7 x = asm_bsr(x) ; 

8 return  lUL<<x; 

9 #else 

10  x = highest_one_01edge (x) ; 

11  return  x * (x>>1) ; 

12  #endif  //  BITS_USE_ASM 

13  } 


To  determine  the  index  of  the  highest  set  bit,  use 


1 static  inline  ulong  highest_one_idx (ulong  x) 

2 //  Return  index  of  highest  bit  set . 

3 //  Return  0 if  no  bit  is  set . 

4 { 

5 #if  defined  BITS_USE_ASM 

6 return  asm_bsr(x); 

7 #else  //  BITS_USE_ASM 


9 if  ( 0==x  ) return  0 ; 

1?  ulong  r = 0; 

12  #if  BITS_PER_L0NG  >=  64 


13 

if 

( 

X 

& 

Oxf f ff f f f f 00000000UL  ) 

{ X 

»=  32 

> 

r + 

14 

15 

#endif 

if 

( 

X 

& 

Oxf f f f 0000UL 

) 

{ x 

» = 

16; 

r += 

16; 

} 

16 

if 

( 

X 

& 

OxOOOOf f 00UL 

) 

{ x 

» = 

8; 

r += 

8; 

} 

17 

if 

( 

X 

& 

OxOOOOOOf 0UL 

) 

{ x 

» = 

4; 

r += 

4; 

} 

18 

if 

( 

X 

& 

OxOOOOOOOcUL 

) 

{ x 

» = 

2; 

r += 

2; 

} 

19 

if 

( 

X 

& 

0x00000002UL 

) 

{ 

r += 

1; 

> 

20  return  r; 

21  #endif  //  BITS_USE_ASM 

22  } 


32; 


} 


The  branches  in  the  non-assembler  part  of  the  routine  can  be  avoided  by  a technique  given  in  ma  rel.96, 
sect. 7. 1.3]  (version  for  64-bit  words): 


1 static  inline  ulong  highest_one_idx (ulong  x) 

2 { 


3 

#def ine 

MU0  0x5555555555555555UL 

// 

MU0  == 

4 

#def ine 

MU1  0x3333333333333333UL 

// 

MU1  == 

5 

#def ine 

MU2  OxOf Of Of Of Of Of Of OfUL 

// 

MU  2 == 

6 

#def ine 

MU3  OxOOf f OOf f OOf f OOf fUL 

// 

MU  3 == 

7 

#def ine 

MU4  OxOOOOf fffOOOOffffUL 

// 

MU4  == 

8 

#def ine 

MU5  OxOOOOOOOOf f f f f f f fUL 

// 

MU  5 == 

9 

ulong  r = ld_neq(x,  x & MU0) 

10 

+ (ld_neq(x,  x & MU1)  « 

1) 

11 

+ (ld_neq(x,  x & MU2)  « 

2) 

12 

+ (ld_neq(x,  x & MU3)  « 

3) 

13 

+ (ld_neq(x,  x & MU4)  « 

4) 

14 

+ (ld_neq(x,  x & MU5)  « 

5); 

15 

return  r; 

16 

} 

((-1UU/3UL)  ==  . . ,01010101_2 
( (-1UQ/5UL)  ==  . . ,00110011_2 

( (-1UL)/17UL)  ==  . . ,00001111_2 
((-1UD/257UL)  ==  (8  ones) 

( (-1UL) / 65537UL)  ==  (16  ones) 

( (-1UL) / 4294967297UL)  ==  (32  ones) 


The  auxiliary  function  ld_neq()  is  given  in  [FXT:  bits/bitldeq.h  : 


1 static  inline  bool  ld_neq(ulong  x,  ulong  y) 

2 //  Return  whether  f loor (log2(x) ) ! =f loor (log2(y) ) 

3 { return  ( (x~y)  > (x&y)  ) ; } 


The  following  version  for  64-bit  words  provided  by  Sebastiano  Vigna  [priv.  comm.]  is  an  implementation 
of  Brodal’s  algorithm  [5151  alg.B,  sect. 7. 1.3]: 

1 static  inline  ulong  highest_one_idx (ulong  x) 

2 { 

3 if  ( x ==  0 ) return  0; 

4 ulong  r = 0; 

5 if  ( x & Oxf f ff f f f f 00000000UL  ) { x »=  32;  r +=  32;  } 

6 if  ( x & Oxf f f f 0000UL  ) { x »=  16;  r +=  16;  } 

7 x | = (x  <<  16)  ; 

8 x | = (x  <<  32) ; 

9 const  ulong  y = x & Oxf f OOf Of OccccaaaaUL ; 
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10  const  ulong  z = 0x8000800080008000UL; 

11  ulong  t = z & (y  | C ( y I z)  - ( x ~ y ))); 

12  t |=  (t  « 15) ; 

13  t |=  (t  « 30)  ; 

14  t |=  (t  « 60) ; 

15  return  r + ( t >>  60  ) ; 

16  } 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

It 

16 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 
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1.6.2  Isolating  the  highest  block  of  ones  or  zeros 


Isolate  the  left  block  of  zeros  with  the  function 


static  inline  ulong  high_zeros (ulong  x) 

//  Return  word  where  all  the  (high  end)  zeros  are  set . 
//  e.g.:  00011001  — > 11100000 

//  Returns  0 if  highest  bit  is  set : 

//  11011001  — > 00000000 

{ 


x |=  x»l 
x |=  x>>2 


x |=  x»4 
x |=  x»8 


#if 


x 

x 


1=  x»16; 
BITS_PER_L0NG 
1=  x»32 ; 


#endif 

return  ~x; 


>= 


64 


} 


The  left  block  of  ones  can  be  isolated  using  arithmetical  right  shifts: 

static  inline  ulong  high_ones (ulong  x) 

//  Return  word  where  all  the  (high  end)  ones  are  set . 

//  e.g.  11001011  — > 11000000 
//  Returns  0 if  highest  bit  is  zero : 

//  01110110  — > 00000000 

{ 

long  y = (long)x; 
y &=  y»l ; 
y &=  y»2 ; 
y &=  y»4 ; 
y &=  y»8 ; 
y &=  y>>16; 

#if  BITS_PER_L0NG  >=  64 
y &=  y»32; 

#endif 

return  (ulong) y; 


If  arithmetical  shifts  are  more  expensive  than  unsigned  shifts,  use 

static  inline  ulong  high_ones (ulong  x)  { return  high_zeros(  ~x  ) ; } 


A demonstration  of  selected  functions  operating  on  the  highest  or  lowest  bit  (or  block)  of  binary  words 
is  given  in  [FXT:  bits/bithilo-demo.cc  . Part  of  its  output  is  shown  in  figure  [176- A 


1.7  Functions  related  to  the  base-2  logarithm 

The  following  functions  are  given  in  [FXT:  bits/bit2pow.h  . A function  that  returns  [log2(:r)J  can  be 
implemented  using  the  obvious  algorithm: 

1 
2 

3 

4 

5 

6 

7 

8 
9 

The  result  is  the  same  as  returned  by  highest_one_idx() : 


static  inline  ulong  ld(ulong  x) 

//  Return  f loor (log2(x) ) , 

//  i.e.  return  k so  that  2~k  <=  x < 2~(k+l) 
//  If  x==0,  then  0 is  returned  (!) 

{ 

ulong  k = 0; 

while  ( x»=l  ) { ++k;  } 

return  k; 

} 
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static  inline  ulong  ld(ulong  x)  1 return  highest_one_idx(x) ; } 


The  bit-wise  algorithm  can  be  faster  if  the  average  result  is  known  to  be  small. 

Use  the  function  one_bit_q()  to  determine  whether  its  argument  is  a power  of  2: 

static  inline  bool  one_bit_q(ulong  x) 

//  Return  whether  x \in  {1,2,4,8,16,...} 

{ 

ulong  m = x-1; 

return  (((x~m)»l)  ==  m)  ; 

} 

The  following  function  does  the  same  except  that  it  returns  true  also  for  the  zero  argument: 

static  inline  bool  is_pow_of_2 (ulong  x) 

//  Return  whether  x ==  0(!)  or  x ==  2**k 
{ return  ! (x  & (x-1));  } 


With  FFTs  where  the  length  of  the  transform  is  often  restricted  to  power  of  2 the  following  functions  are 
useful: 


static  inline  ulong  next_pow_of_2 (ulong  x) 

//  Return  x if  x=2**k 

//  else  return  2**ceil (log_2(x) ) 

//  Exception:  returns  0 for  x==0 

{ 


if 

x 

x 

X 

X 

X 


( is_pow_of _2(x) 
= x » 1; 


» 2; 
» 4; 
» 8; 
» 16; 


#if  BITS_PER_LONG 
x |=  x » 32; 


#endif 

return 


64 


x + 1; 


return  x; 


} 


static  inline  ulong  next_exp_of_2 (ulong  x) 
//  Return  k if  x=2**k  else  return  k+1. 

//  Exception:  returns  0 for  x==0 . 

{ 

if  ( x <=  1 ) return  0; 
return  ld(x-l)  + 1; 

} 


The  following  version  should  be  faster  if  inline  assembler  is  used  for  Id  ( ) : 

static  inline  ulong  next_pow_of_2 (ulong  x) 

{ 

if  ( is_pow_of _2(x)  ) return  x; 
ulong  n = lUL«ld(x)  ; //  n<x 

return  n<<l; 

} 


The  following  routine  for  comparison  of  base-2  logarithms  without  actually  computing  them  is  suggested 
by  [2151  rel.58,  sect. 7. 1.3]  [FXT:  bits/bitldeq.h  : 

static  inline  bool  ld_eq(ulong  x,  ulong  y) 

//  Return  whether  f loor (log2(x) )==floor (log2(y) ) 

{ return  ( (x~y)  <=  (x&y)  ) ; } 


1.8  Counting  the  bits  and  blocks  of  a word 


The  following  functions  count  the  ones  in  a binary  word.  They  need  O (log2(BITS_PER_L0NG))  operations. 
We  give  mostly  the  64-bit  versions  [FXT:  bits/bitcount.h  : 

1 static  inline  ulong  bit_count (ulong  x) 

2 //  Return  number  of  bits  set 

3 { 

4 x = (0x5555555555555555UL  & x)  + (0x5555555555555555UL  & (x»  1)) 

5 x = (0x3333333333333333UL  & x)  + (0x3333333333333333UL  & (x»  2)) 

6 x = (OxOf  Of  Of  Of  Of  Of  Of  OfUL  & x)  + (OxOf  Of  Of  Of  Of  Of  Of  OfUL  & (x»  4)) 

7 x = (OxOOf f OOff OOf f OOf fUL  & x)  + (OxOOf f OOf f OOf f OOf fUL  & (x»  8)) 


//  0-2  in  2 bits 
//  0-4  in  4 bits 
//  0-8  in  8 bits 
//  0-16  in  16  bits 
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8 x = (OxOOOOf f ff OOOOf f f fUL  & x)  + (OxOOOOf f f f OOOOf f f fUL  & (x»16));  //  0-32  in  32  bits 

9 x = (OxOOOOOOOOf f f f f f f fUL  & x)  + (OxOOOOOOOOf f f f f f f fUL  & (x»32));  //  0-64  in  64  bits 

10  return  x; 

11  } 


The  underlying  idea  is  to  do  a search  via  bit  masks.  The  code  can  be  improved  to  either 


1 x = C(x»l)  & 0x5555555555555555UL)  + (x  & 0x5555555555555555UL)  ; 

2 x = ( (x»2)  & 0x3333333333333333UL)  + (x  & 0x3333333333333333UL)  ; 

3 x = ( (x»4)  + x)  & OxOf  Of  Of  Of  Of  Of  Of  OfUL; 

4 x +=  x»  8; 

5 x +=  x»16; 

6 x +=  x»32; 

7 return  x & Oxff; 

or  (taken  from  IIP]) 


//  0-2  in  2 bits 

//  0-4  in  4 bits 

//  0-8  in  8 bits 

//  0-16  in  8 bits 

//  0-32  in  8 bits 

//  0-64  in  8 bits 


1 

2 

3 

4 

5 


x -=  (x»l)  & 0x5555555555555555UL ; 

x = ( (x»2)  & 0x3333333333333333UL)  + (x  & 0x3333333333333333UL)  ; 
x = ( (x»4)  + x)  & OxOf  Of  Of  Of  Of  Of  Of  OfUL; 
x *=  0x0 10 10 10 10 10 10 10 1UL ; 
return  x»56; 


//  0-2  in  2 bits 
//  0-4  in  4 bits 
//  0-8  in  8 bits 


Which  of  the  latter  two  versions  is  faster  mainly  depends  on  the  speed  of  integer  multiplication. 

The  following  code  for  32-bit  words  (given  by  Johan  Rbnnblom  [priv.  comm.])  may  be  advantageous  if 
loading  constants  is  expensive.  Note  some  constants  are  in  octal  notation: 

1 static  inline  uint  CountBits32 (uint  a) 

2 { 

3 uint  mask  = 011111111111UL; 

4 a = (a  - ( (a&'mask)  >>1)  ) - ( (a»2)&mask)  ; 

5 a +=  a>>3; 

6 a = (a  & 070707)  + ((a»18)  & 070707); 

7 a *=  010101; 

8 return  ((a>>12)  & 0x3f) ; 

9 } 

If  the  table  holds  the  bit-counts  of  the  numbers  0. . . 255,  then  the  bits  can  be  counted  as  follows: 

1 ulong  bit_count (ulong  x) 

2 { 

3 unsigned  char  ct  = 0; 

4 ct  +=  tab[  x & Oxff  ] ; x »=  8; 

5 ct  +=  tab[  x & Oxff  ] ; x >>=  8; 

6 [—snip—]  /*  BYTES_PER_LONG  times  */ 

7 ct  +=  tab[  x & Oxff  ] ; 

8 return  ct ; 

9 } 

However,  while  table  driven  methods  tend  to  excel  in  synthetic  benchmarks,  they  can  be  very  slow  if  they 
cause  cache  misses. 

We  give  a method  to  count  the  bits  of  a word  of  a special  form: 

1 static  inline  ulong  bit_count_01 (ulong  x) 

2 //  Return  number  of  bits  in  a word 

3 //  for  words  of  the  special  form  00 . . . 0001 . . . 11 

4 { 

5 ulong  ct  = 0; 

6 ulong  a; 

7 #if  BITS_PER_L0NG  ==  64 


8 

a 

= (x  & 

(1UL«32)  ) 

» 

(32-5); 

//  test  bit 

9 

X 

»=  a; 

ct  +=  a; 

10 

#endif 

11 

a 

= (x  & 

(1UL«16)  ) 

» 

(16-4); 

//  test  bit 

12 

X 

»=  a; 

ct  +=  a; 

13 

14 

a 

= (x  & 

(1UL«8)  ) 

» 

(8-3);  // 

test 

bit  8 

15 

X 

»=  a; 

ct  +=  a; 

16 

17 

a 

= (x  & 

(1UL«4)  ) 

» 

(4-2) ; // 

test 

bit  4 

18 

X 

»=  a; 

ct  +=  a; 

19 

20 

a 

= (x  & 

/•"N 

CM 

V 

V 

t-H 

» 

(2-1);  // 

test 

bit  2 

21 

X 

»=  a; 

ct  +=  a; 

22 

23 

a 

= (x  & 

(1UL«1)) 

» 

(1-0);  // 

test 

bit  1 

20 
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24 

x »=  a; 

ct 

+=  a; 

25 

26 

ct  +=  x & 

1; 

//  test  bit  0 

return  ct ; 

29 

} 

All  branches  are  avoided,  thereby  the  code  may  be  useful  on  a planet  with  pink  air,  for  further  details 
see  (3011. 


1.8.1  Sparse  counting 

If  the  (average  input)  word  is  known  to  have  only  a few  bits  set,  the  following  sparse  count  variant  can 
be  advantageous: 

1 static  inline  ulong  bit_count_sparse (ulong  x) 

2 //  Return  number  of  bits  set . 

3 { 

4 ulong  n = 0; 

5 while  ( x ) { ++n;  x &=  (x-1) ; } 

6 return  n; 

7 } 

The  loop  will  execute  once  for  each  set  bit.  Partial  unrolling  of  the  loop  should  be  an  improvement  for 
most  cases: 

1 ulong  n = 0; 

2 do 

3 { 

4 

5 

6 

7 

8 } 

9 while  ( x ) ; 

10  return  n; 

If  the  number  of  bits  is  close  to  the  maximum,  use  the  given  routine  with  the  complement: 

1 static  inline  ulong  bit_count_dense (ulong  x) 

2 //  Return  number  of  bits  set . 

3 //  The  loop  (of  bit_count_sparse () ) will  execute  once  for 

4 //  each  unset  bit  (i.e.  zero)  of  x. 

5 { 

6 return  BITS_PER_L0NG  - bit_count_sparse ( ~x  ); 

7 } 

If  the  number  of  ones  is  guaranteed  to  be  less  than  16,  then  the  following  routine  (suggested  by  Gunther 
Piez  [priv.  comm.])  can  be  used: 

1 static  inline  ulong  bit_count_15 (ulong  x) 

2 //  Return  number  of  set  bits,  must  have  at  most  15  set  bits. 

3 { 

4 x — = (x»l)  & 0x5555555555555555UL ; //  0-2  in  2 bits 

5 x = ( (x»2)  & 0x3333333333333333UL)  + (x  & 0x3333333333333333UL)  ; //  0-4  in  4 bits 

6 x *=  OxllllllllllllllllUL; 

7 return  x»60; 

8 } 

A routine  for  words  with  no  more  than  3 set  bits  is 

1 static  inline  ulong  bit_count_3 (ulong  x) 

2 { 

3 x — = (x»l)  & 0x5555555555555555UL ; //  0-2  in  2 bits 

4 x *=  0x5555555555555555UL ; 

5 return  x»62; 

6 } 


1.8.2  Counting  blocks 

Compute  the  number  of  bit-blocks  in  a binary  word  with  the  following  function: 

1 static  inline  ulong  bit_block_count (ulong  x) 

2 //  Return  number  of  bit  blocks . 

3 //  E.g.: 

4 II.  A.  .11111.  . .111.  ->  3 


+=  (x ! =0) 
+=  (x ! =0) 
+=  (x ! =0) 
+=  (x ! =0) 


x &=  (x-1) : 
x &=  (x-1) : 
x &=  (x-1) : 
x &=  (x-1) : 
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5 //  . . .1. .11111. . .111  ->  3 

6 //  1 1.1.  . ->  3 

7 //  111.1111  ->  2 

8 { 

9 return  (x  & 1)  + bit_count(  (x~(x>>1))  ) / 2; 

10  } 

Similarly,  the  number  of  blocks  with  two  or  more  bits  can  be  counted  via: 

1 static  inline  ulong  bit_block_ge2_count (ulong  x) 

2 //  Return  number  of  bit  blocks  with  at  least  2 bits. 

3 //  E.g.: 

4 //  . .1. .11111.  . .111.  ->  2 

5 //  . . .1. .11111. . .111  ->  2 

6 //  1 1.1.  . ->  0 

7 //  111.1111  ->  2 

8 { 

9 return  bit_block_count  ( x & ( (x«l)  & (x>>l)  ) ); 

10  } 

1.8.3  GCC  built-in  functions  f 

Newer  versions  of  the  C compiler  of  the  GNU  Compiler  Collection  (GCC  [146],  starting  with  version  3.4) 

include  a function builtin_popcountl  (ulong)  that  counts  the  bits  of  an  unsigned  long  integer.  The 

following  list  is  taken  from  |147|: 

int  builtin_ffs  (unsigned  int  x) 

Returns  one  plus  the  index  of  the  least  significant  1-bit  of  x, 
or  if  x is  zero,  returns  zero. 

int  builtin_clz  (unsigned  int  x) 

Returns  the  number  of  leading  0-bits  in  x,  starting  at  the 

most  significant  bit  position.  If  x is  0,  the  result  is  undefined. 

int  builtin_ctz  (unsigned  int  x) 

Returns  the  number  of  trailing  0-bits  in  x,  starting  at  the 

least  significant  bit  position.  If  x is  0,  the  result  is  undefined. 

int  builtin_popcount  (unsigned  int  x) 

Returns  the  number  of  1-bits  in  x. 

int  builtin_parity  (unsigned  int  x) 

Returns  the  parity  of  x,  i.e.  the  number  of  1-bits  in  x modulo  2. 

The  names  of  the  corresponding  versions  for  arguments  of  type  unsigned  long  are  obtained  by  adding  ‘1’ 

(ell)  to  the  names,  for  the  type  unsigned  long  long  append  Tl’.  Two  more  useful  built-ins  are: 

void  builtin_pref etch  (const  void  *addr,  ...) 

Prefetch  memory  location  addr 

long  builtin_expect  (long  exp,  long  c) 

Function  to  provide  the  compiler  with  branch  prediction  information. 

1.8.4  Counting  the  bits  of  many  words  f 


x [ 0]=11111111 

a0=llllllll 

al= 

a2= 

a3= 

a4= 

x [ 1]  =11111111 

a0= 

al=llllllll 

a2= 

a3= 

a4= 

x [ 2] =11111111 

a0=llllllll 

al=llllllll 

a2= 

a3= 

a4= 

x [ 31=11111111 

a0= 

al= 

a2=llllllll 

a3= 

a4= 

x [ 41=11111111 

a0=llllllll 

al= 

a2=llllllll 

a3= 

a4= 

x [ 51=11111111 

a0= 

al=llllllll 

a2=llllllll 

a3= 

a4= 

x [ 61=11111111 

a0=llllllll 

al=llllllll 

a2=llllllll 

a3= 

a4= 

x [ 71=11111111 

a0= 

al= 

a2= 

a3=llllllll 

a4= 

x [ 81=11111111 

a0=llllllll 

al= 

a2= 

a3=llllllll 

a4= 

x [ 91=11111111 

a0= 

al=llllllll 

a2= 

a3=llllllll 

a4= 

x [10] =11111111 

a0=llllllll 

al=llllllll 

a2= 

a3=llllllll 

a4= 

x[ll]=llllllll 

a0= 

al= 

a2=llllllll 

a3=llllllll 

a4= 

x [12] =11111111 

a0=llllllll 

al= 

a2=llllllll 

a3=llllllll 

a4= 

x [13] =11111111 

a0= 

al=llllllll 

a2=llllllll 

a3=llllllll 

a4= 

x [14] =11111111 

a0=llllllll 

al=llllllll 

a2=llllllll 

a3=llllllll 

a4= 

x [15] =11111111 

a0= 

al= 

a2= 

a3= 

a4=llllllll 

x [16] =11111111 

a0=llllllll 

al= 

a2= 

a3= 

a4=llllllll 

Figure  1.8-A:  Counting  the  bits  of  an  array  (where  all  bits  are  set)  via  vertical  addition. 
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For  counting  the  bits  in  a long  array  the  technique  of  vertical  addition  can  be  useful.  For  ordinary 
addition  the  following  relation  holds: 

a + b ==  (a~b)  + ((a&b)«l) 

The  carry  term  (a&b)  is  propagated  to  the  left.  We  now  replace  this  ‘horizontal’  propagation  by  a ‘vertical’ 
one,  that  is,  propagation  into  another  word.  An  implementation  of  this  idea  is  [FXT:  bits/bitcount-v- 
demo.cc  : 

1 ulong 

2 bit_count_leq31 (const  ulong  *x,  ulong  n) 

3 //  Return  sum(j=0,  n-1,  bit_count (x [j] ) ) 

4 //  Must  have  n<=31 

5 { 

6 ulong  a0=0,  al=0,  a2=0,  a3=0,  a4=0; 


7 

//  1,  3,  7, 

15,  31, 

< — = max  n 

8 

for  (ulong  k=0;  k<n;  ++k) 

9 

{ 

10 

ulong  cy  = x [k] ; 

11 

{ ulong  t = aO  & cy; 

aO  ~=  cy; 

cy  = t;  } 

12 

{ ulong  t = al  & cy; 

al  '=  cy; 

cy  = t;  } 

13 

{ ulong  t = a2  & cy; 

a2  *=  cy; 

cy  = t;  } 

14 

{ ulong  t = a3  & cy; 

a3  ~=  cy; 

cy  = t;  } 

15 

{ a4  “=  cy;  } 

16 

//  [ PRINT  x [k] , aO, 

al,  a2,  a3, 

a4  ] 

17 

} 

18 

19 

ulong  b = bit_count (aO) ; 

20 

b +=  (bit_count (al) <<1) ; 

21 

b +=  (bit_count (a2) <<2) ; 

22 

b +=  (bit_count (a3) <<3) ; 

23 

b +=  (bit_count (a4) <<4) ; 

24 

return  b ; 

25  } 

Figure  [T.8-A| shows  the  intermediate  values  with  the  computation  of  a length-17  array  of  all-ones  words. 
After  the  loop  the  values  of  the  variables  aO,  . . . , a4  are 

a4=llllllll 

a3= 

a2= 

al= 

aO=llllllll 

The  columns,  read  as  binary  numbers,  tell  us  that  in  all  positions  of  all  words  there  were  a total  of 
17  = IOOOI2  bits.  The  remaining  instructions  compute  the  total  bit-count. 

After  some  simplifications  and  loop-unrolling  a routine  for  counting  the  bits  of  15  words  can  be  given  as 
[FXT:  bits/bitcount-v.cc|: 

1 static  inline  ulong  bit_count_vl5 (const  ulong  *x) 

2 //  Return  sum(j=0,  14,  bit_count  (x  [j]  ) ) 

3 //  Technique  is  "vertical"  addition. 

4 { 

5 #define  VV(A)  { ulong  t = A & cy;  A ~=  cy;  cy  = t ; } 

6 ulong  al,  a2,  a3; 

7 ulong  a0=x[0]; 


8 

{ ulong 

cy 

= 

x[  1]; 

VV(aO) 

al  = cy; 

} 

9 

{ ulong 

cy 

= 

x[  2]  ; 

VV(aO) 

al  ~=  cy 

> 

10 

{ ulong 

cy 

= 

x[  3]  ; 

VV(aO) 

VV(al) ; 

a2  = cy;  } 

11 

{ ulong 

cy 

= 

x[  4]  ; 

VV(aO) 

VV(al) ; 

a2  “=  cy;  } 

12 

{ ulong 

cy 

= 

x[  5]  ; 

VV(aO) 

VV(al) ; 

a2  “=  cy;  } 

13 

{ ulong 

cy 

= 

x[  6]  ; 

VV(aO) 

VV(al) ; 

a2  “=  cy;  } 

14 

{ ulong 

cy 

= 

x[  7]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  = cy;  } 

15 

{ ulong 

cy 

= 

x[  8]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

16 

{ ulong 

cy 

= 

x[  9]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

17 

{ ulong 

cy 

= 

x [10]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

18 

{ ulong 

cy 

= 

x[ll]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

19 

{ ulong 

cy 

= 

x [12]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

20 

{ ulong 

cy 

= 

x [13]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

21 

{ ulong 

cy 

= 

x [14]  ; 

VV(aO) 

VV(al) ; 

VV(a2) 

a3  “=  cy;  } 

#undef  VV 


24  ulong  b = bit_count (aO) ; 

25  b +=  (bit_count (al) <<1) ; 
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26  b +=  (bit_count (a2) <<2) ; 

27  b +=  (bit_count (a3) <<3) ; 

28  return  b ; 

29  } 

Each  of  the  macros  VV  gives  three  machine  instructions,  one  AND,  XOR,  and  MOVE.  The  routine  for 
the  user  is 

1 ul< 

2 bi' 

3 // 

4 { 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

1? 

18  } 

Compared  to  the  obvious  method  of  bit-counting 

1 ulong  bit_count_v2 (const  ulong  *x,  ulong  n) 

2 { 

3 ulong  b = 0; 

4 for  (ulong  k=0;  k<n;  ++k)  b +=  bit_count (x [k] ) ; 

5 return  b ; 

6 } 

our  routine  uses  roughly  30  percent  less  time  when  an  array  of  100,000,000  words  is  processed.  There 
are  many  possible  modifications  of  the  method.  If  the  bit-count  routine  is  rather  slow,  one  may  want  to 
avoid  the  four  calls  to  it  after  the  processing  of  every  15  words.  Instead,  the  variables  aO,  . . . , a3  could 
be  added  (vertically!)  to  an  array  of  more  elements.  If  that  array  has  n elements,  then  only  with  each 
block  of  2n  — 1 words  n calls  to  the  bit-count  routine  are  necessary. 


;_count_v (const  ulong  *x,  ulong  n) 

Return  sum(j=0,  n-1,  bit_count (x [j] ) ) 

ulong  b = 0; 

const  ulong  *xe  = x + n + 1; 

while  ( x+15  < xe  ) //  process  blocks  of  15  elements 

{ 

b +=  bit_count_vl5(x) ; 
x +=  15; 

> 

//  process  remaining  elements: 

const  ulong  r = (ulong) (xe-x-1)  ; 

for  (ulong  k=0;  k<r;  ++k)  b+=bit_count (x  [k] ) ; 

return  b ; 


1.9  Words  as  bitsets 


1.9.1  Testing  whether  subset  of  given  bitset 

The  following  function  tests  whether  a word  it,  as  a bitset,  is  a subset  of  the  bitset  given  as  the  word  e 
[FXT:  bits/bitsubsetq.h  : 

1 static  inline  bool  is_subset (ulong  u,  ulong  e) 

2 //  Return  whether  the  set  bits  of  u are  a subset  of  the  set  bits  of  e. 

3 //  That  is,  as  bitsets,  test  whether  u is  a subset  of  e. 

4 { 

5 return  ( (u  & e)==u  ) ; 

6 //  return  ( (u  & ~e)==0  ); 

7 //  return  ( (~u  I e)!=0  ); 

8 } 

If  u contains  any  bits  not  set  in  e,  then  these  bits  are  cleared  in  the  AND-operation  and  the  test  for 
equality  will  fail.  The  second  version  tests  whether  no  element  of  u lies  outside  of  e,  the  third  is  obtained 
by  complementing  the  equality.  A proper  subset  of  e is  a subset  ^ e: 

1 static  inline  bool  is_proper_subset (ulong  u,  ulong  e) 

2 //  Return  whether  u (as  bitset)  is  a proper  subset  of  e. 

3 { 

4 return  ( (u<e)  &&  ( (u  & e)==u)  ); 

5 } 

The  generated  machine  code  contains  a branch: 

101  xorl  ’/,eax,  "/,eax  # prephitmp.71 

102  cmpq  '/.rsi,  "/,rdi  # e,  u 
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103 

jae 

. L6 

#,  /*  branch  to  end  of  function  */ 

104 

andq 

’/.rdi , 

"/.rsi 

# u,  e 

106 

xorl 

‘/,eax , 

"/.eax 

# prephitmp.71 

107 

cmpq 

"/.rdi , 

"/.rsi 

# u,  e 

108 

sete 

"/.al 

#,  prephitmp.71 

Replace  the  Boolean  operator  by  the  bit-wise  operator  to  obtain  branch-free  machine  code: 


101 

cmpq 

"/.rsi , 

"/.rdi 

# e,  u 

102 

setb 

"/.al 

# , tmp63 

103 

andq 

"/.rdi , 

"/.rsi 

# u,  e 

105 

cmpq 

"/.rdi , 

"/.rsi 

# u,  e 

106 

sete 

"/.dl 

# , tmp66 

107 

andl 

"/.edx , 

"/.eax 

# tmp66 , tmp63 

108 

movzbl 

"/.al,  ; 

(eax 

# tmp63,  tmp61 

1.9.2  Testing  whether  an  element  is  in  a given  set 


We  determine  whether  a given  number  is  an  element  of  a given  set  (which  must  be  a subset  of  the  set 
{0,  1,  2,  . . . , BITS_PER_L0NG— 1}).  For  example,  to  determine  whether  x is  a prime  less  than  32,  use  the 
function 

1 ulong  m = (1UL«2)  I (1UL«3)  I (1UL«5)  I ...  I (1UL«31);  //precomputed 

2 static  inline  ulong  is_tiny_prime (ulong  x) 

3 { 

4 return  m & (1UL  « x) ; 

5 } 

The  same  idea  can  be  applied  to  look  up  tiny  factors  [FXT:  bits/tinyfactors.h  : 

1 static  inline  bool  is_tiny_f actor (ulong  x,  ulong  d) 

2 //  For  x,d  < BITS_PER_L0NG  (!) 

3 //  return  whether  d divides  x (1  and  x included  as  divisors) 

4 //  no  need  to  check  whether  d==0 

5 // 

6 { 

7 return  ( 0 !=  ( (tiny_f actors_tab [x] >>d)  & 1 ) ); 

8 } 

The  function  uses  the  precomputed  array  [FXT:  bits/tinyfactors.cc  : 


1 extern  const  ulong  tiny_f actors_tab  []  = 

2 { 


3 

OxOUL , 

// 

X 

= 

0 

( bits: 

) 

4 

0x2UL , 

// 

X 

= 

1 

1 

( bits: 

. . .1.) 

5 

0x6UL , 

// 

X 

= 

2 

1 2 

( bits: 

. .11.) 

6 

OxaUL , 

// 

X 

= 

3 

1 3 

( bits: 

.1.1.) 

7 

0xl6UL , 

// 

X 

= 

4 

12  4 

( bits: 

1.11.) 

8 

0x22UL , 

// 

X 

= 

5 

1 5 

( bits: 

. . 1 

. . .1.) 

9 

0x4eUL , 

// 

X 

= 

6 

12  3 6 

( bits: 

. 1 . 

.111.) 

10 

0x82UL , 

// 

X 

= 

7 

1 7 

( bits: 

1.  . 

. . .1.) 

11 

0xll6UL, 

// 

X 

= 

8 

12  4 8 

12 

0x20aUL , 

// 

X 

= 

9 

13  9 

13 

[ — snip — ] 

14 

0x20000002UL, 

// 

X 

= 

29 

1 29 

15 

0x4000846eUL, 

// 

X 

= 

30 

12  3 5 

6 10  15 

30 

16 

0x80000002UL, 

// 

X 

= 

31 

1 31 

17 

#if  ( BITS  PER  LONG  > 32 

) 

18 

0xl00010116UL, 

// 

X 

= 

32 

12  4 8 

16  32 

19 

0x20000080aUL , 

// 

X 

= 

33 

1 3 11  33 

20 

[ — snip — ] 

21 

0x2000000000000002UL , 

// 

X 

= 

61 

1 61 

22 

0x4000000080000006UL , 

// 

X 

= 

62 

1 2 31  62 

23 

0x800000000020028aUL 

// 

X 

= 

63 

13  7 9 

21  63 

24 

#endif  //  ( BITS  PER  LONG 

> 32  ) 

25 

>; 

Bit-arrays  of  arbitrary  size  are  discussed  in 


section  4.6 


on  page 


164 
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1.10  Index  of  the  i-th  set  bit 


To  determine  the  index  of  the  i-th  set  bit,  we  use  a technique  similar  to  the  method  for  counting  the  bits 
of  a word.  Only  the  64-bit  version  is  shown  [FXT:  bits/ith-one-idx.h|: 


1 

2 

3 

4 

5 

6 
7 


10 

11 

12 

13 

14 


19 

22 

23 


26 

27 

30 

31 

34 

35 


38 


static  inline  ulong  ith_one_idx(ulong  x,  ulong  i) 

//  Return  index  of  the  i-th  set  bit  of  x where  0 <=  i < bit_count (x) . 
{ 


ulong  x2  = x - ( (x>> 1 ) k 0x5555555555555555UL) ; 
ulong  x4  = ( (x2»2)  k 0x3333333333333333UL)  + 

(x2  k 0x3333333333333333UL) ; 
ulong  x8  = ( (x4»4)  + x4)  k OxOf  Of  Of  Of  Of  Of  OfOfUL; 
ulong  ct  = (x8  * 0x010 10 10 10 10 10 10 1UL)  » 56; 


//  0-2  in  2 bits 

//  0-4  in  4 bits 
//  0-8  in  8 bits 
//  bit  count 


++i ; 

if  ( ct  < i ) return  ~0UL;  //  less  than  i bits  set 


ulong  xl6  = (OxOOf f OOf f OOf f OOf fUL  k x8)  + (OxOOf f OOf f OOf f OOf fUL  k (x8»8));  //  0-16 

ulong  x32  = (OxOOOOf f f f OOOOf f f fUL  k xl6)  + (OxOOOOf f f f OOOOf f f f UL  k (xl6»16));  //  0-32 

ulong  w,  s = 0; 

w = x32  k OxffffffffUL; 


if  ( w 

< 

i ) 

{ 

s 

+=  32; 

i -=  w; 

; } 

xl6  »= 

= s; 

w = xl6  k 

: Oxffff; 

if  ( w 

< 

i ) 

{ 

s 

+=  16; 

i -=  w; 

; } 

x8  »= 

s ; 

w — x8 

& 

Oxff 

if  ( w 

< 

i ) 

’{ 

s 

+=  8; 

i -=  w; 

> 

x4  »= 

s ; 

w = x4 

& 

Oxf ; 

if  ( w 

< 

i ) 

{ 

s 

+=  4; 

i -=  w; 

> 

x2  »= 

s ; 

w = x2 

& 

3; 

if  ( w 

< 

i ) 

{ 

s 

+=  2; 

i -=  w ; 

> 

x »=  s; 

s +=  ( (x&l)  ! = i ) ; 
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return  s ; 

} 


1.11  Avoiding  branches 


Branches  are  expensive  operations  with  many  CPUs,  especially  if  the  CPU  pipeline  is  very  long.  A useful 
trick  is  to  replace 

if  ( (x<0)  I I (x>m)  ) { . . . } 

where  x might  be  a signed  integer,  by 

if  ( (unsigned) x > m ) { . . . } 

The  obvious  code  to  test  whether  a point  (x,  y)  lies  outside  a square  box  of  size  m is 
if  ( (x<0)  | | (x>m)  | | (y<0)  I I (y>m)  ) { . . . } 

If  m is  a power  of  2,  it  is  better  to  use 

if  ( ( (ulong)x  I (ulong)y  ) > (unsigned)m  ) { . . . } 

The  following  functions  are  given  in  [FXT:  bits/branchless. h . This  function  returns  max(0,a;).  That  is, 
zero  is  returned  for  negative  input,  else  the  unmodified  input: 

1 static  inline  long  max0(long  x) 

2 { 

3 return  x k ~(x  » (BITS_PER_L0NG-1) ) ; 

4 } 

There  is  no  restriction  on  the  input  range.  The  trick  used  is  that  with  negative  x the  arithmetic  shift  will 
give  a word  of  all  ones  which  is  then  negated  and  the  AND-operation  clears  all  bits.  Note  this  function 


co  oo  -q  Ci  or  4^  Co  to  i— 1 Gicn4^coto 
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will  only  work  if  the  compiler  emits  an  arithmetic  right  shift,  see  section 
routine  computes  min(0,  x): 

1 static  inline  long  minO(long  x) 

2 //  Return  min(0,  x) , i.e.  return  zero  for  positive  input 

3 { 

4 return  x & (x  » (BITS_PER_L0NG-1) ) ; 

5 } 

The  following  upos_*()  functions  only  work  for  a limited  range.  The  highest  bit  must  not  be  set  as  it  is 
used  to  emulate  the  carry  flag.  Branchless  computation  of  the  absolute  difference  |a  — b\: 

1  static  inline  ulong  upos_abs_dif f (ulong  a,  ulong  b) 

{ 

long  dl  = b - a; 

long  d2  = (dl  k (dl»(BITS_PER_LONG-l)  ) ) «1 ; 
return  dl  - d2;  //  ==  (b  - d)  - (a  + d) ; 

} 

e following  routine  sorts  two  values: 

static  inline  void  upos_sort2 (ulong  fea,  ulong  kb) 

//  Set  {a,  b}  :=  {min(a,  b) , max(a,b)} 

//  Both  a and  b must  not  have  the  most  significant  bit  set 

{ 

long  d = b - a; 
d k=  (d»  (BITS_PER_L0NG-1)  ) ; 
a +=  d; 
b -=  d; 

} 

Johan  Ronnblom  gives  [priv.  comm.]  the  following  versions  for  signed  integer  minimum,  maximum,  and 
absolute  value,  that  can  be  advantageous  for  CPUs  where  immediates  are  expensive: 

1 #define  B1  (BITS_PER_L0NG-1)  //  bits  of  signed  int  minus  one 

2 #def  ine  MINI(x,y)  (((x)  k ( ( (int)  ( (x)-(y)  ))»B1)  ) + ((y)  & ~ ( ( (int)  ( (x)  - (y)  ) ) >>B1)  ) ) 

3 #def ine  MAXI(x,y)  (((x)  k ~ ( ( (int) ( (x)- (y) ) ) >>B1) ) + ((y)  & ( ( (int) ( (x) - (y) ) >>B1) ) ) ) 

4 #define  ABSI(x)  (((x)  k ~ ( ( (int) (x) )>>B1) ) - ((x)  & ( ( (int) (x) )>>B1) ) ) 

Your  compiler  may  be  smarter  than  you  thought 


The  machine  code  generated  for 


X 

= X 

k 

'(x  » 

(BITS_PER_L0NG-1) ) ; 

//  maxO() 

35: 

48 

99 

cqto 

37: 

48 

83 

c4  08 

add 

$Ox8,"/0rsp  //  stack  adjustment 

3b: 

48 

f 7 

d2 

not 

"/,rdx 

°/,rdx , */,rax 

3e : 

48 

21 

dO 

and 

The  variable  x resides  in  the  register  rAX  both  at  start  and  end  of  the  function.  The  compiler  uses  a 
special  (AMD64)  instruction  cqto.  Quoting  [13]: 

Copies  the  sign  bit  in  the  rAX  register  to  all  bits  of  the  rDX  register.  The  effect  of  this 
instruction  is  to  convert  a signed  word,  doubleword,  or  quadword  in  the  rAX  register  into 
a signed  doubleword,  quadword,  or  double-quadword  in  the  rDX:rAX  registers.  This  action 
helps  avoid  overflow  problems  in  signed  number  arithmetic. 

Now  the  equivalent 

x = ( x<0  ? 0 : x );  //  maxO()  "simple  minded" 

is  compiled  to: 


35: 

ba 

00 

00 

00  00 

mov 

$0x0 , "/edx 

3a: 

48 

85 

cO 

test 

‘/.rax , 

/.rax 

3d: 

48 

Of 

48 

c2 

cmovs 

°/,rdx , 

"/.rax  //  note  °/,edx  is  °/,rdx 

A conditional  move  (cmovs)  instruction  is  used  here.  That  is,  the  optimized  version  is  (on  my  machine) 
actually  worse  than  the  straightforward  equivalent. 


1.1.3  on  page  Ul  The  following 
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A second  example  is  a function  to  adjust  a given  value  when  it  lies  outside  a given  range  [FXT: 
bits/branchless.h  : 

1 static  inline  long  clip_range (long  x,  long  mi,  long  ma) 

2 //  Code  equivalent  to  (for  mi<=ma) : 

3 //  if  ( x<mi  ) x = mi; 

4 //  else  if  ( x>ma  ) x = ma; 

5 { 

6 x -=  mi; 

7 x = clip_rangeO(x,  ma-mi) ; 

8 x +=  mi; 

9 return  x ; 

10  } 


The  auxiliary  function  used  involves  one  branch: 

1 static  inline  long  clip_rangeO (long  x,  long  m) 

2 //  Code  equivalent  (for  m>0)  to: 


//  if  ( x<0  ) x = 0; 

//  else  if  ( x>m  ) x = m; 

//  return  x; 

if  ( (ulong) x > (ulong)m 

) x = 

m & ~(x  » (BITS_PER_L0NG-1) ) ; 

> 

return  x ; 

The  generated  machine  code  is 

0 

48  89  f 8 

mov 

’/.rdi , "/.rax 

3 

48  29  f 2 

sub 

’/.rsi,’/, rdx 

6 

31  c9 

xor 

7,ecx,‘/,ecx 

8 

48  29  fO 

sub 

’/.rsi,’/, rax 

b 

78  0a 

js 

17  <_Z2CLlll+0xl7>  //  the  branch 

d 

48  39  dO 

cmp 

’/.rdx , ’/.rax 

10 

48  89  dl 

mov 

’/,rdx , */,rcx 

13 

48  Of  4e  c8 

cmovle 

‘/.rax , */,rcx 

17 

48  8d  04  Oe 

lea 

(’/.rsi  , "/.rex , 1 ) , '/.rax 

Now  we  replace  the  code  by 

1 static  inline  long  clip_range (long  x,  long  mi,  long  ma) 

2 { 

3 x -=  mi; 

4 if  ( x<0  ) x = 0; 

5 //  else  //  commented  out  to  make  (compiled)  function  really  branchless 

6 { 

7 ma  -=  mi; 

8 if  ( x>ma  ) x = ma; 

9 > 

10  x +=  mi; 

11  } 

Then  the  compiler  generates  branchless  code: 


0 

48 

89 

f 8 

mov 

"/.rdi , "/.rax 

3 

b9 

00 

00 

00  00 

mov 

$0x0, ’/.ecx 

8 

48 

29 

fO 

sub 

’/.rsi, 

’/.rax 

b 

48 

Of 

48 

cl 

emovs 

"/.rex , 

’/.rax 

f 

48 

29 

f 2 

sub 

"/.rsi,’/,rdx 

12 

48 

39 

dO 

cmp 

’/.rdx , */,rax 

15 

48 

Of 

4f 

c2 

emovg 

"/.rdx , 7, rax 

19 

48 

01 

fO 

add 

’/.rsi , 

°/.rax 

Still,  with  CPUs  that  do  not  have  a conditional  move  instruction  (or  some  branchless  equivalent  of  it) 
the  techniques  shown  in  this  section  can  be  useful. 


1.12  Bit-wise  rotation  of  a word 


Neither  C nor  C++  have  a statement  for  bit-wise  rotation  of  a binary  word  (which  may  be  considered  a 
missing  feature).  The  operation  can  be  emulated  via  [FXT:  bits/bitrotate.h|: 

1 static  inline  ulong  bit_rotate_lef t (ulong  x,  ulong  r) 

2 //  Return  word  rotated  r bits  to  the  left 

3 //  (i.e.  toward  the  most  significant  bit) 
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4 { 

5 return  (x«r)  I (x>>(BITS_PER_LONG-r) ) ; 

6 } 

As  already  mentioned,  GCC  emits  exactly  the  CPU  instruction  that  is  meant  here,  even  with  non-constant 
argument  r.  Explicit  use  of  the  corresponding  assembler  instruction  should  not  do  any  harm: 

1 static  inline  ulong  bit_rotate_right (ulong  x,  ulong  r) 

2 //  Return  word  rotated  r bits  to  the  right 

3 //  (i.e.  toward  the  least  significant  bit) 

4 { 

5 #if  defined  BITS_USE_ASM  //  use  x86  asm  code 

6 return  asm_ror(x,  r) ; 

7 #©lse 

8 return  (x»r)  I (x<<(BITS_PER_LONG-r) ) ; 

9 #endif 

10  } 

Here  we  use  an  assembler  instruction  when  available  [FXT:  bits/bitasm-amd64.h|: 

1 static  inline  ulong  asm_ror (ulong  x,  ulong  r) 

2 { 

3 asm  ("rorq  "/.’/.cl,  "/,0"  : "=r"  (x)  : "0"  (x)  , "c"  (r)); 

4 return  x; 

5 } 

Rotation  using  only  a part  of  the  word  length  can  be  implemented  as 

1 static  inline  ulong  bit_rotate_left (ulong  x,  ulong  r,  ulong  ldn) 

2 //  Return  ldn-bit  word  rotated  r bits  to  the  left 

3 //  (i.e.  toward  the  most  significant  bit) 

4 //  Must  have  0 <=  r <=  ldn 

5 { 

6 ulong  m = ~0UL  » ( BITS_PER_L0NG  - ldn  ) ; 

7 x &=  m; 

8 x = (x<<r)  | (x»(ldn-r)); 

9 x &=  m; 

10  return  x; 

11  } 

and 

1 static  inline  ulong  bit_rotate_right (ulong  x,  ulong  r,  ulong  ldn) 

2 //  Return  ldn-bit  word  rotated  r bits  to  the  right 

3 //  (i.e.  toward  the  least  significant  bit) 

4 //  Must  have  0 <=  r <=  ldn 

5 { 

6 ulong  m = ~0UL  » ( BITS_PER_L0NG  - ldn  ) ; 

7 x &=  m; 

8 x = (x>>r)  | (x«(ldn-r)); 

9 x &=  m; 

10  return  x; 

11  } 

Finally,  the  functions 

1 static  inline  ulong  bit_rotate_sgn (ulong  x,  long  r,  ulong  ldn) 

2 //  Positive  r — > shift  away  from  element  zero 

3 { 

4 if  ( r > 0 ) return  bit_rotate_left(x,  (ulong) r,  ldn); 

5 else  return  bit_rotate_right (x , (ulong) -r,  ldn); 

6 } 

and  (full-word  version) 

1 static  inline  ulong  bit_rotate_sgn (ulong  x,  long  r) 

2 //  Positive  r — > shift  away  from  element  zero 

3 { 

4 if  ( r > 0 ) return  bit_rotate_left(x,  (ulong) r); 

5 else  return  bit_rotate_right (x , (ulong) -r); 

6 } 


are  sometimes  convenient. 


1.13:  Binary  necklaces  J 
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1.13  Binary  necklaces  J 

We  give  several  functions  related  to  cyclic  rotations  of  binary  words  and  a class  to  generate  binary 
necklaces. 


1.13.1  Cyclic  matching,  minimum,  and  maximum 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

lg 

17 


1 

2 

3 

4 

5 

6 

7 

8 


9 

10 

11 

ii 

14 


The  following  function  determines  whether  there  is  a cyclic  right  shift  of  its  second  argument  so  that  it 
matches  the  first  argument.  It  is  given  in  [FXT:  bits/bitcyclic-match.h  : 

static  inline  ulong  bit_cyclic_match(ulong  x,  ulong  y) 

//  Return  r if  x==rotate_right(y,  r)  else  return  “OUL. 

//  In  other  words : return 

//  how  often  the  right  arg  must  be  rotated  right  (to  match  the  left) 

//  or,  equivalently: 

//  how  often  the  left  arg  must  be  rotated  left  (to  match  the  right) 

{ 

ulong  r = 0; 
do 

I 

if  ( x==y  ) return  r; 
y = bit_rotate_right(y,  1); 

I 

while  ( ++r  < BITS_PER_L0NG  ) ; 
return  “OUL; 

} 

The  functions  shown  work  on  the  full  length  of  the  words,  equivalents  for  the  sub-word  of  the  lowest  ldn 
bits  are  given  in  the  respective  files.  Just  one  example: 

static  inline  ulong  bit_cyclic_match(ulong  x,  ulong  y,  ulong  ldn) 

//  Return  r if  x==rotate_right(y,  r,  ldn)  else  return  ~0UL 
//  (using  ldn-bit  words) 

{ 

ulong  r = 0; 
do 

I 

if  ( x==y  ) return  r; 
y = bit_rotate_right(y,  1,  ldn); 

I 

while  ( ++r  < ldn  ) ; 
return  “OUL; 

} 


The  minimum  among  all  cyclic  shifts  of  a word  can  be  computed  via  the  following  function  given  in  [FXT : 
bits/bitcyclic-minmax.h  : 

1 static  inline  ulong  bit_cyclic_min(ulong  x) 

2 //  Return  minimum  of  all  rotations  of  x 

3 { 

4 ulong  r = 1 ; 

5 ulong  m = x; 

6 do 

7 I 

8 x = bit_rotate_right(x,  1); 

9 if  ( x<m  ) m = x; 

10  } 

11  while  ( ++r  < BITS_PER_L0NG  ); 

11 

14  } 


return  m; 
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1.13.2  Cyclic  period  and  binary  necklaces 

Selecting  from  all  n-bit  words  those  that  are  equal  to  their  cyclic  minimum  gives  the  sequence  of  the 
binary  length-n  necklaces,  see  chapter  |18  on  page  370[  For  example,  with  6-bit  words  we  find: 


word 

period 

word 

period 

1 

..11.1 

6 

1 

6 

. .1111 

6 

...  11 

6 

.1.1.1 

2 

. .1.1 

6 

.1.111 

6 

. . Ill 

6 

.11.11 

3 

. 1 . . 1 

3 

.11111 

6 

.1.11 

6 

linn 

1 

The  values  in  each  right  column  can  be  computed  using  [FXT:  bits/bitcyclic-period.h  : 

1 static  inline  ulong  bit_cyclic_period(ulong  x,  ulong  ldn) 

2 //  Return  minimal  positive  bit-rotation  that  transforms  x into  itself. 

3 //  (using  ldn-bit  words) 

4 //  The  returned  value  is  a divisor  of  ldn. 

5 { 

6 ulong  y = bit_rotate_right (x,  1,  ldn); 

7 return  bit_cyclic_match(x,  y,  ldn)  + 1; 

8 } 

It  is  possible  to  completely  avoid  the  rotation  of  partial  words:  let  d be  a divisor  of  the  word  length  n. 
Then  the  rightmost  (n  — 1)  d bits  of  the  word  computed  as  x~  (x>>d)  are  zero  if  and  only  if  the  word  has 
period  d.  So  we  can  use  the  following  function  body: 

ulong  si  = BITS_PER_LONG-ldn; 
for  (ulong  s=l;  s<ldn;  ++s) 

{ 

++sl ; 

if  ( 0==(  (x~(x>>s))  « si  ) ) return  s; 

> 

return  ldn; 

Testing  for  periods  that  are  not  divisors  of  the  word  length  can  be  avoided  as  follows: 

ulong  f = tiny_f actors_tab  [ldn] ; 
ulong  si  = BITS_PER_LONG-ldn; 
for  (ulong  s=l;  s<ldn;  ++s) 

++sl ; 
f »=  1; 

if  ( 0==(f&l)  ) continue; 
if  ( 0==(  (x~(x>>s))  « si  ) ) return  s; 

> 

return  ldn; 

The  table  of  tiny  factors  used  is  shown  in  section[1.9.2  on  page  24| 

The  version  for  ldn==BITS_PER_LONG  can  be  optimized  similarly: 

1 static  inline  ulong  bit_cyclic_period (ulong  x) 

2 //  Return  minimal  positive  bit-rotation  that  transforms  x into  itself . 

3 //  (same  as  bit_cyclic_period(x,  BITS_PER_L0NG)  ) 

4 // 

5 //  The  returned  value  is  a divisor  of  the  word  length, 

6 //  i.e.  1, 2,4,8, ... ,BITS_PER_LONG. 

7 { 

8 ulong  r = 1 ; 

9 do 

10  { 

11  ulong  y = bit_rotate_right(x,  r) ; 

12  if  ( x==y  ) return  r; 

13  r «=  1; 

14  > 

15  while  ( r < BITS_PER_L0NG  ) ; 

16 

17  return  r;  //  ==  BITS_PER_LONG 

18  } 


1.13.3  Generating  all  binary  necklaces 

We  can  generate  all  necklaces  by  the  FKM  algorithm  given  in  section[l8.1.1|on  page [371]  Here  we  special- 
ize the  method  for  binary  words.  The  words  generated  are  the  cyclic  maxima  [FXT:  class  bit_necklace 
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in  bits /bit-necklace. h : 

1 class  bit_necklace 

2 { 

3 public: 


4 

ulong  a_; 

// 

5 

ulong  j_; 

// 

6 

ulong  n2_ ; 

// 

7 

ulong  j2_; 

// 

8 

ulong  n_; 

// 

9 

ulong  mm_ ; 

// 

10 
1 1 

ulong  tfb_; 

// 

% 

public : 

necklace 


number  of  bits 


ig  j : j2==2**(j-l) 
in  words 


13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 


bit_necklace (ulong  n)  { init(n);  } 

~bit_necklace ()  { ; } 

void  init (ulong  n) 

{ 

if  ( 0==n  ) n = 1;  //  avoid  hang 

if  ( n>=BITS_PER_L0NG  ) n = BITS_PER_L0NG; 

n_  = n; 

n2_  = 1UL«  (n-1)  ; 

mm_  = (~0UL)  » (BITS_PER_LONG-n) ; 

tfb_  = tiny_f actors_tab [n]  » 1; 

tfb_  |=  n2_ ; //  needed  for  n==BITS_PER_L0NG 

first () ; 

> 


void  first () 
a_  = 0; 

j-  = i; 

j2_  = 1; 

} 


ulong  data()  const  { return  a_;  } 
ulong  period ()  const  { return  j_;  } 

The  method  for  computing  the  successor  is 


1 ulong  next () 

2 //  Create  next  necklace. 

3 //  Return  the  period,  zero  when  current  necklace  is  last. 

if  ( a_==mm_  ) { firstO;  return  0;  } 

do 

{ 

//  next  lines  compute  index  of  highest  zero,  same  result  as 
//  j_  = highest_zero_idx(  a_  “ (~mm_)  ); 

//  but  the  direct  computation  is  faster: 
j-  = n_  - 1; 
ulong  jb  = 1UL  « j_; 

while  ( 0!  = (a_  & jb)  ) { — j_;  j b>>= 1 ; 1 

j2_  = 1UL  « j_; 

++j-; 

a_  1=  j2_ ; 

a_  = bit_copy_periodic (a_ , j_,  n_) ; 

> 

while  ( 0==(tfb_  & j2_)  );  //  necklaces  only 

return  j_; 


It  uses  the  following  function  for  periodic  copying  [FXT:  bits/bitperiodic.h|: 

1 static  inline  ulong  bit_copy_periodic (ulong  a,  ulong  p,  ulong  ldn) 

2 //  Return  word  that  consists  of  the  lowest  p bits  of  a repeated 

3 //  in  the  lowest  ldn  bits  (higher  bits  are  zero) . 

4 //  E.g.:  if  p==3,  ldn=7  and  a=*****xyz  (8-bit),  the  return  Ozxyzxyz. 

5 //  Must  have  p>0  and  ldn>0. 

6 { 

7 a &=  ( ~0UL  » (BITS_PER_L0NG-p)  ) ; 
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8 for  (ulong  s=p;  s<ldn;  s<<=1)  { a |=  (a«s)  ; } 

9 a &=  ( ~0UL  » (BITS_PER_LONG-ldn)  ) ; 

10  return  a; 

11  } 

Finally,  we  can  easily  detect  whether  a necklace  is  a Lyndon  word: 

1 ulong  is_lyndon_word()  const  { return  ( j 2_  & n2_) ; } 

2 

3 ulong  next_lyn() 

4 //  Create  next  Lyndon  word. 

5 //  Return  the  period  (==n) , zero  when  current  necklace  is  last. 

6 { 

7 if  ( a_==mm_  ) { firstO;  return  0;  } 

8 do  { nextO;  } while  ( ! is_lyndon_word()  ); 

9 return  n_ ; 

10  > 

11  >; 

About  54  million  necklaces  per  second  are  generated  (with  n = 32),  corresponding  to  a rate  of  112  M/s 
for  pre-necklaces  [FXT:  bits/bit- necklace-demo. cc  . 


1.13.4  Computing  the  cyclic  distance 


A function  to  compute  the  cyclic  distance  between  two  words  [FXT:  bits/bitcyclic-dist.h  is: 


1 static  inline  ulong  bit_cyclic_dist (ulong  a, 

2 //  Return  minimal  bitcount  of  (t  " b) 

3 //  where  t runs  through  the  cyclic  rotations 

4 { 

5 ulong  d = ~0UL; 

6 ulong  t = a; 


7 

do 

8 

i 

9 

ulong  z = t b; 

10 

ulong  e = bit_count(  z ); 

11 

if  ( e < d ) d=e; 

12 

t = bit  rotate  right (t,  1) 

13 

> 

14 

while  ( t ! =a  ) ; 

15 

16  } 

return  d; 

ulong  b) 
of  a. 


If  the  arguments  are  cyclic  shifts  of  each  other,  then  zero  is  returned.  A version  for  partial  words  is 
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static  inline  ulong  bit_cyclic_dist (ulong  a,  ulong  b,  ulong  ldn) 

{ 


ulong  d = ~0UL; 

const  ulong  m = (~0UL>> (BITS_PER_LONG-ldn) ) ; 

b &=  m; 

a &=  m; 

ulong  t = a; 

do 

{ 

ulong  z = t b; 

ulong  e = bit_count(  z ); 

if  ( e < d ) d=e; 

t = bit_rotate_right (t , 1,  ldn); 

> 

while  ( t ! =a  ) ; 
return  d; 


1.13.5  Cyclic  XOR  and  its  inverse 

The  functions  [FXT:  bits/bitcyclic-xor.h 

1 static  inline  ulong  bit_cyclic_rxor (ulong  x) 

2 { 

3 return  x “ bit_rotate_right(x,  1); 

4 } 

and 
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static  inline  ulong  bit_cyclic_lxor (ulong  x) 
{ 

return  x * bit_rotate_left (x,  1); 

} 


return  a word  whose  number  of  set  bits  is  even.  A word  and  its  complement  produce  the  same  result. 


The  inverse  functions  need  no  rotation  at  all,  the  inverse  of  bit_cyclic_rxor  ()  is  the  inverse  Gray  code 
(see  section  1.16  on  page  41 1 : 


static  inline  ulong  bit_cyclic_inv_rxor (ulong  x) 
//  Return  v so  that  bit_cyclic_rxor (v)  ==  x. 

{ 

return  inverse_gray_code (x) ; 

} 


The  argument  x must  have  an  even  number  of  bits.  If  this  is  the  case,  the  lowest  bit  of  the  result  is  zero. 
The  complement  of  the  returned  value  is  also  an  inverse  of  bit_cyclic_rxor  () . 


The  inverse  of  bit_cyclic_lxor  ()  is  the  inverse  reversed  code  (see  section  1.16.6  on  page  45): 


static  inline  ulong  bit_cyclic_inv_lxor (ulong  x) 
//  Return  v so  that  bit_cyclic_lxor (v)  ==  x. 

{ 

return  inverse_rev_gray_code(x) ; 

} 


We  do  not  need  to  mask  out  the  lowest  bit  because  for  valid  arguments  (that  have  an  even  number  of  bits) 
the  high  bits  of  the  result  are  zero.  This  function  can  be  used  to  solve  the  quadratic  equation  v2  + v = x 


in  the  finite  field  GF(2n)  when  normal  bases  are  used,  see  section  42.6.2  on  page  903 


1.14  Reversing  the  bits  of  a word 


The  bits  of  a binary  word  can  efficiently  be  reversed  by  a sequence  of  steps  that  reverse  the  order  of 
certain  blocks.  For  16-bit  words,  we  need  4 = log2(16)  such  steps  [FXT:  bits/revbin-steps-demo.ccj: 
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1.14.1  Swapping  adjacent  bit  blocks 


We  need  a couple  of  auxiliary  functions  given  in  [FXT:  bits/bitswap.h  . Pairs  of  adjacent  bits  can  be 
swapped  via 


static  inline  ulong  bit_swap_l (ulong  x) 

//  Return  x with  neighbor  bits  swapped. 

{ 

#if  BITS_PER_L0NG  ==  32 
ulong  m = 0x55555555UL ; 

lifSeBITS_PER_LONG  ==  64 

ulong  m = 0x5555555555555555UL ; 

#endif 

Sendif 

return  ((x  & m)  « 1)  I ((x  & (~m))  >>  1); 

} 

The  64-bit  branch  is  omitted  in  the  following  examples.  Adjacent  groups  of  2 bits  are  swapped  by 


static  inline  ulong  bit_swap_2 (ulong  x) 

//  Return  x with  groups  of  2 bits  swapped. 

{ 

ulong  m = 0x33333333UL ; 

return  ( (x  & m)  « 2)  I ((x  & (~m))  >>  2); 

} 


Equivalently, 


34 


Chapter  1:  Bit  wizardry 


1 

2 

3 

4 

5 

6 


1 

2 

3 

4 

5 

6 


1 

2 

3 

4 

5 

6 


static  inline  ulong  bit_swap_4(ulong  x) 

//  Return  x with  groups  of  4 bits  swapped. 

{ 

ulong  m = OxOf Of Of OfUL ; 

return  ( (x  St  m)  « 4)  I C (x  & (~m))  >>  4); 

} 

and 

static  inline  ulong  bit_swap_8 (ulong  x) 

//  Return  x with  groups  of  8 bits  swapped. 

{ 

ulong  m = OxOOffOOffUL; 

return  ((x  St  m)  « 8)  I C (x  & (~m))  >>  8); 

} 


When  swapping  half-words  (here  for  32-bit  architectures) 


static  inline  ulong  bit_swap_16(ulong  x) 

//  Return  x with  groups  of  16  bits  swapped. 

{ 

ulong  m = OxOOOOf f f fUL ; 

return  ( (x  & m)  « 16)  I ( (x  & (m<<16))  » 16); 

} 


we  could  also  use  the  bit-rotate  function  from  section[T.12  on  page  27 


or 


return  (x  « 16)  I (x  >>  16) ; 


The  GCC  compiler  recognizes  that  the  whole  operation  is  equivalent  to  a (left  or  right)  word  rotation 
and  indeed  emits  just  a single  rotate  instruction. 


1.14.2  Bit-reversing  binary  words 

The  following  is  a function  to  reverse  the  bits  of  a binary  word  [FXT:  bits/revbin.h  : 

static  inline  ulong  revbin(ulong  x) 

//  Return  x with  reversed  bit  order. 

{ 

x = bit_swap_l (x) ; 
x = bit_swap_2(x) ; 
x = bit_swap_4(x) ; 
x = bit_swap_8(x) ; 
x = bit_swap_16 (x) ; 

#if  BITS_PER_LONG  >=  64 
x = bit_swap_32 (x) ; 

#endif 

return  x; 

} 

The  steps  after  bit_swap_4()  correspond  to  a byte-reverse  operation.  This  operation  is  just  one  assem- 
bler instruction  for  many  CPUs.  The  inline  assembler  with  GCC  for  AMD64  CPUs  is  given  in  [FXT: 
bits/bitasm-amd64.h  : 

static  inline  ulong  asm_bswap (ulong  x) 

{ 

asm  ("bswap  "/.0"  : "=r"  (x)  : "0"  (x)); 
return  x; 

} 

We  use  it  for  byte  reversal  if  available: 

1 static  inline  ulong  bswap(ulong  x) 

2 //  Return  word  with  reversed  byte  order. 

3 { 

4 #if def  BITS_USE_ASM 

5 x = asm_bswap(x) ; 

6 #else 

7 x = bit_swap_8(x) ; 

8 x = bit_swap_16 (x) ; 

9 #if  BITS_PER_L0NG  >=  64 

10  x = bit_swap_32 (x) ; 

11  #endif 

12  #endif  //  def  BITS_USE_ASM 

13  return  x; 
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14  } 

The  function  actually  used  for  bit  reversal  is  good  for  both  32  and  64  bit  words: 

1 static  inline  ulong  revbin(ulong  x) 

2 { 

3 x = bit_swap_l (x)  ; 

4 x = bit_swap_2(x) ; 

5 x = bit_swap_4(x) ; 

6 x = bswap(x); 

7 return  x; 

8 } 

The  masks  can  be  generated  in  the  process: 

1 static  inline  ulong  revbin(ulong  x) 

2 { 

3 ulong  s = BITS_PER_LONG  » 1; 

4 ulong  m = ~0UL  » s; 

5 while  ( s ) 

6 { 

7 x = ( (x  & m)  « s ) ~ ( (x  & (~m))  » s ); 

8 s »=  1; 

9 m ~=  (m<<s) ; 

10  } 

11  return  x; 

12  } 

The  above  function  will  not  always  beat  the  obvious,  bit-wise  algorithm: 

1 static  inline  ulong  revbin(ulong  x) 

2 { 

3 ulong  r = 0,  ldn  = BITS_PER_L0NG; 

4 while  ( ldn — ! = 0 ) 

5 { 

6 r «=  1; 

7 r +=  (x&l ) ; 

8 x »=  1; 

9 } 

10  return  r; 

11  } 

Therefore  the  function 

1 static  inline  ulong  revbin(ulong  x,  ulong  ldn) 

2 //  Return  word  with  the  ldn  least  significant  bits 

3 //  (i.e.  bit_0  ...  bit_{ldn-l})  of  x reversed, 

4 //  the  other  bits  are  set  to  zero. 

5 { 

6 return  revbin(x)  » (BITS_PER_LONG-ldn) ; 

7 } 

should  only  be  used  if  ldn  is  not  too  small,  else  be  replaced  by  the  trivial  algorithm. 

We  can  use  table  lookups  so  that,  for  example,  eight  bits  are  reversed  at  a time  using  a 256-byte  table. 
The  routine  for  full  words  is 

1 unsigned  char  revbin_tab [256] ; //  reversed  8-bit  words 

2 ulong  revbin_t (ulong  x) 

3 { 

4 ulong  r = 0; 

5 for  (ulong  k=0;  k<BYTES_PER_LONG ; ++k) 

6 { 

7 r <<=  8; 

8 r | = revbin_tab  [ x & 255  ] ; 

9 x »=  8; 

10  > 

11  return  r; 

12  } 

The  routine  can  be  optimized  by  unrolling  to  avoid  all  branches: 

1 static  inline  ulong  revbin_t (ulong  x) 

2 { 

3 ulong  r = revbin_tab[  x & 255  ] ; x »=  8; 

4 r «=  8;  r |=  revbin_tab[  x & 255  ] ; x »=  8; 

5 r «=  8;  r |=  revbin_tab[  x & 255  ] ; x »=  8; 

6 #if  BYTES_PER_L0NG  > 4 
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#endif 

12 

r «=  8; 

r 

1 = revbin_tab [ 

X 

] 

13 

return  r; 

14 

} 

However,  reversing  the  first  230  binary  words  with  this  routine  takes  (on  a 64-bit  machine)  longer  than 
with  the  routine  using  the  bit_swap_NN()  calls,  see  [FXT:  bits/revbin-tab-demo.cc  . 

1.14.3  Generating  the  bit-reversed  words  in  order 

If  the  bit-reversed  words  have  to  be  generated  in  the  (reversed)  counting  order,  there  is  a significantly 
cheaper  way  to  do  the  update  [FXT:  bits/revbin-upd.h  : 

1 static  inline  ulong  revbin_upd(ulong  r,  ulong  h) 

2 //  Let  n=2**ldn  and  h=n/2 . 

3 //  Then,  with  r ==  revbin(x,  ldn)  at  entry,  return  revbin(x+l,  ldn) 

4 //  Note:  routine  will  hang  if  called  with  r the  all-ones  word 

5 { 

6 while  ( !((r~=h)&h)  ) h »=  1; 

7 return  r ; 

8 } 

Now  assume  we  want  to  generate  the  bit-reversed  words  of  all  N = 2"  — 1 words  less  than  2".  The  total 
number  of  branches  with  the  while-loop  can  be  estimated  by  observing  that  for  half  of  the  updates  just 
one  bit  changes,  two  bits  change  for  a quarter,  three  bits  change  for  one  eighth  of  all  updates,  and  so  on. 
So  the  loop  executes  less  than  2 N times: 
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(1.14-1) 


For  large  values  of  N the  following  method  can  be  significantly  faster  if  a fast  routine  is  available  for  the 
computation  of  the  least  significant  bit  in  a word.  The  underlying  observation  is  that  for  a fixed  word  of 
size  n there  are  just  n different  patterns  of  bit-changes  with  incrementing.  We  generate  a lookup  table 
of  the  bit-reversed  patterns,  utab[] , an  array  of  BITS_PER_LONG  elements: 

1 static  inline  void  make_revbin_upd_tab (ulong  ldn) 

2 //  Initialize  lookup  table  used  by  revbin_tupd() 

3 { 

4 ut ab  [0]  = 1UL«  (ldn-1)  ; 

5 for  (ulong  k=l;  k<ldn;  ++k)  utab[k]  = utab[k-l]  I (utab  [k—  1]  »1)  ; 

6 } 


The  change  patterns  for  n = 5 start  as 


pattern 

1 
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1 

. .111 

1 

...  11 

1 

. 1111 

1 

...  11 


reversed  pattern 

;:i : : : 

: ii : : 

;:i : : : 

:iii : 

ii: : : 


The  pattern  with  x set  bits  is  used  for  the  update  of  k to  k + 1 when  the  lowest  zero  of  k is  at  position 


X — 1: 
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4 

The  update  routine  can  now  be  implemented  as 
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1 static  inline  ulong  revbin_tupd(ulong  r,  ulong  k) 

2 //  Let  r==revbin(k,  ldn)  then 

3 //  return  revbin(k+l,  ldn). 

4 //  NOTE  1:  need  to  call  make_revbin_upd_tab(ldn)  before  usage 

5 //  where  ldn=log_2(n) 

6 //  NOTE  2:  different  argument  structure  than  revbin_upd() 

7 { 

8 k = lowest_one_idx(~k) ; //  lowest  zero  idx 

9 r ~ = ut  ab  [k]  ; 

10  return  r; 

11  } 


The  revbin-update  routines  are  used  for  the  revbin  permutation  described  in  section  2.6 
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Figure  1.14- A:  Relative  performance  of  the  revbin-update  and  (full)  revbin  routines.  The  timing  of  the 
bit-wise  update  routine  is  normalized  to  1.  Values  in  each  column  should  be  compared,  smaller  values 
correspond  to  faster  routines.  A column  labeled  “A  bits”  gives  the  timing  for  reversing  the  N least 
significant  bits  of  a word. 


The  relative  performance  of  the  different  revbin  routines  is  shown  in  figure  |1.14-A|  As  a surprise,  the 
full-word  revbin  function  is  consistently  faster  than  both  of  the  update  routines,  mainly  because  the 
machine  used  (see  appendix  B on  page  922)  has  a byte  swap  instruction.  As  the  performance  of  table 
lookups  is  highly  machine  dependent  your  results  can  be  very  different. 


1.14.4  Alternative  techniques  for  in-order  generation 


The  following  loop,  due  to  Brent  Lehmann  [priv.  comm.],  also  generates  the  bit-reversed  words  in  suc- 
cession: 

ulong  n = 32;  //  a power  of  2 

ulong  p=0,  s=0,  n2=  2*n; 
do 

//  here:  s is  the  bit-reversed  word 
P +=  2; 

s “=  n - (n  / (p&-p) ) ; 

> 

while  ( p<n2  ) ; 

revbin-increment  is  branchless  but  involves  a division  which  usually  is  an  expensive  operation.  With 
a fast  bit-scan  function  the  loop  should  be  replaced  by 

do 
{ 

p +=  1; 

s “=  n - (n  >>  (lowest_one_idx(p) +1) ) ; 

> 

while  ( p<n  ) ; 

recursive  algorithm  for  the  generation  of  the  bit-reversed  words  in  order  is  given  in  [FXT:  bits/revbin- 
c-demo.cc  : 


ulong  N ; 

void  revbin_rec (ulong  f,  ulong  n) 

{ 

//  visit(  f ) 

for  (ulong  m=N»l;  m>n;  m»=l)  revbin_rec(f+m,  m) ; 
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6 } 

Call  revbin_rec  (0 , 0)  to  generate  all  N-bit  bit-reversed  words. 

A technique  to  generate  all  revbin  pairs  in  a pseudo  random  order  is  given  in  section  |41.4  on  page  873| 


1.15  Bit-wise  zip 


The  bit-wise  zip  (bit-zip)  operation  moves  the  bits  in  the  lower  half  to  even  indices  and  the  bits  in  the 
upper  half  to  odd  indices.  For  example,  with  8-bit  words  the  permutation  of  bits  is 

[abcdABCD]  |—  > [aAbBcCdD] 

A straightforward  implementation  is 

1 ulong  bit_zip(ulong  a,  ulong  b) 

2 { 

3 ulong  x = 0; 

4 ulong  m = 1,  s = 0; 

5 for  (ulong  k=0;  k< (BITS_PER_L0NG/2) ; ++k) 

6 { 

7 x |=  (a  & m)  « s; 

8 ++s ; 

9 x |=  (b  & m)  « s; 

10  m «=  1; 

11  > 

12  return  x; 

13  } 


Its  inverse  (bit-unzip)  moves  even  indexed  bits  to  the  lower  half-word  and  odd  indexed  bits  to  the  upper 
half-word: 
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void  bit_unzip (ulong  x,  ulong  &a,  ulong  &b) 

{ 

a = 0;  b = 0; 
ulong  m = 1,  s = 0; 

for  (ulong  k=0 ; k< (BITS_PER_L0NG/2) ; ++k) 

4 

a |=  (x  k m)  » s; 

++s ; 

m «=  1; 

b |=  (x  & m)  » s; 
m «=  1; 

} 


For  a faster  implementation  we  will  use  the  butterf ly_*  () -functions  which  are  defined  in  [FXT: 
bits/bitbutterfly.h  (64-bit  version): 

1 static  inline  ulong  butterfly_4 (ulong  x) 

2 //  Swap  in  each  block  of  16  bits  the  two  central  blocks  of  4 bits. 

3 { 

4 const  ulong  ml  = OxOf OOOf OOOf OOOf 00UL ; 

5 const  ulong  s = 4; 

6 const  ulong  mr  = ml  » s ; 

7 const  ulong  t = ( (x  & ml)  » s ) I ( (x  & mr)  <<  s ) ; 

8 x = (x  & "(ml  | mr))  I t; 

9 return  x ; 

10  } 


The  following  version  of  the  function  may  look  more  elegant  but  is  actually  slower: 

1 static  inline  ulong  butterf ly_4 (ulong  x) 

2 { 

3 const  ulong  m = OxOffOOffOOff OOffOUL; 

4 ulong  c = x & m; 

5 c “=  (c«4)  “ (c>>4)  ; 

6 c &=  m; 

7 return  x * c ; 

8 } 

The  optimized  versions  of  the  bit-zip  and  bit-unzip  routines  are  [FXT:  bits/bitzip.h  : 

1 static  inline  ulong  bit_zip (ulong  x) 

2 { 
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3 #if  BITS_PER_LONG  ==  64 

4 x = butterf ly_16(x) ; 

5 #endif 

6 x = butterf ly_8(x) ; 

7 x = butterf ly_4(x) ; 

8 x = butterf ly_2 (x) ; 

9 x = butterf ly_l  (x)  ; 

10  return  x; 

11  } 

and 

1 static  inline  ulong  bit_unzip(ulong  x) 

2 { 

3 x = butterf ly_l  (x)  ; 

4 x = butterf ly_2  (x)  ; 

5 x = butterf ly_4(x) ; 

6 x = butterf ly_8(x) ; 

7 #if  BITS_PER_L0NG  ==  64 

8 x = butterf ly_16(x) ; 

9 #endif 

10  return  x; 

11  } 


Laszlo  Hars  suggests  [priv.  comm.]  the  following  routine  (version  for  32-bit  words),  which  can  be  obtained 
by  making  the  compile-time  constants  explicit: 


1 static  inline  uint32  bit_zip(uint32  x) 

2 { 

3 x = ((x  & OxOOOOffOO)  « 8)  I ((x  » 8) 

4 x = ((x  & OxOOfOOOfO)  « 4)  I ((x  » 4) 

5 x = ((x  & OxOcOcOcOc)  « 2)  I ((x  >>  2) 

6 x = ((x  & 0x22222222)  « 1)  I ((x  » 1) 

7 return  x; 

8 } 


OxOOOOffOO)  | (x  & OxffOOOOff) 
OxOOfOOOfO)  | (x  & OxfOOffOOf) 
OxOcOcOcOc)  | (x  & 0xc3c3c3c3) 
0x22222222)  I (x  & 0x99999999) 


A bit-zip  version  for  words  whose  upper  half  is  zero  is  (64-bit  version) 


1 static  inline  ulong  bit_zip0 (ulong  x) 

2 //  Return  word  with  lower  half  bits  in  even  indices. 

3 { 


4 

X = 

(x 

1 (x«16) ) 

& 

OxOOOOff ff OOOOf f f fUL 

5 

X = 

(x 

1 (x«8) ) 

& 

OxOOf f OOf f OOf f OOf fUL 

6 

X = 

(x 

1 (x«4) ) 

& 

OxOf Of Of Of Of Of Of OfUL 

7 

X = 

(x 

1 (x«2) ) 

& 

0x3333333333333333UL 

8 

X = 

(x 

1 (x«l)) 

& 

0x5555555555555555UL 

9 

return 

x ; 

10  } 


Its  inverse  is 


1 static  inline  ulong  bit_unzip0 (ulong  x) 

2 //  Bits  at  odd  positions  must  be  zero. 

3 { 


4 

X = 

(x 

1 (x»l)) 

& 

0x3333333333333333UL ; 

5 

X = 

(x 

1 (x»2) ) 

& 

OxOf Of Of Of Of Of Of Of UL ; 

6 

X = 

(x 

1 (x»4) ) 

& 

OxOOf f OOf f OOf f OOf fUL ; 

7 

X = 

(x 

1 (x»8) ) 

& 

OxOOOOf fff OOOOf fffUL ; 

8 

X = 

(x 

1 (x»16) ) 

& 

OxOOOOOOOOf f f f f f f f UL ; 

9 

return 

x ; 

10  } 


The  simple  structure  of  the  routines  suggests  trying  the  following  versions  of  bit-zip  and  its  inverse: 

1 static  inline  ulong  bit_zip (ulong  x) 

2 { 


3 

ulong  y 

= (x  » 

32)  ; 

4 

X 

&=  OxffffffffUL; 

5 

X 

= (x  | 

(x«16)  ) 

& 

OxOOOOff ff OOOOf fffUL 

6 

y 

= (y  1 

(y<<16) ) 

& 

OxOOOOff ff OOOOf fffUL 

7 

X 

= (x  I 

(x«8)  ) 

& 

OxOOf f OOf f OOf f OOf fUL 

8 

y 

= (y  1 

(y«8)  ) 

& 

OxOOf f OOf f OOf f OOf fUL 

9 

X 

= (x  I 

(x«4)  ) 

& 

OxOf Of Of Of Of Of Of OfUL 

10 

y 

= (y  1 

(y«4)  ) 

& 

OxOf Of Of Of Of Of Of OfUL 

11 

X 

= (x  I 

(x«2)  ) 

& 

0x3333333333333333UL 

12 

y 

= (y  1 

(y«2)  ) 

& 

0x3333333333333333UL 

13 

X 

= (x  I 

(x«l)) 

& 

0x5555555555555555UL 
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14  y = (y  | (y«l) ) & 0x5555555555555555UL ; 

15  x |=  (y«l)  ; 

16  return  x; 

17  } 

1 static  inline  ulong  bit_unzip(ulong  x) 

2 { 

3 ulong  y = (x  » 1)  & 0x5555555555555555UL; 

4 x &=  0x5555555555555555UL ; 

5 x = (x  | (x»l))  & 0x3333333333333333UL ; 

6 y = (y  | (y»l) ) & 0x3333333333333333UL ; 

7 x = (x  | (x>>2) ) & OxOf Of Of Of Of Of Of Of UL ; 

8 y = (y  | (y»2)  ) & OxOf  Of  Of  Of  Of  Of  Of  OfUL ; 

9 x = (x  | (x>>4) ) & OxOOf f OOf f OOf f OOf f UL ; 

10  y = (y  | (y»4)  ) & OxOOf  f OOf  f OOf  f OOf  fUL ; 

11  x = (x  | (x»8)  ) & OxOOOOf  f f f OOOOf  f f f UL ; 

12  y = (y  | (y»8) ) & OxOOOOf  fff  OOOOf  fffUL ; 

13  x = (x  | (x»16)  ) & OxOOOOOOOOf f f f f f f f UL ; 

14  y = (y  | (y»16) ) & OxOOOOOOOOf  fff  f fffUL ; 

15  x |=  Cy«32); 

16  return  x; 

17  } 

As  the  statements  involving  the  variables  x and  y are  independent  the  CPU-internal  parallelism  can  be 
used.  However,  these  versions  turn  out  to  be  slightly  slower  than  those  given  before. 

The  following  function  moves  the  bits  of  the  lower  half-word  of  x into  the  even  positions  of  lo  and  the 
bits  of  the  upper  half-word  into  hi  (two  versions  given): 

1 #def ine  BPLH  (BITS_PER_L0NG/2) 

2 

3 static  inline  void  bit_zip2 (ulong  x,  ulong  &lo,  ulong  &hi) 

4 { 

5 #if  1 

6 x = bit_zip(x) ; 

7 lo  = x & 0x5555555555555555UL ; 

8 hi  = (x»l)  & 0x5555555555555555UL ; 

9 #6lse 

10  hi  = bit_zipO(  x » BPLH  ); 

11  lo  = bit_zipO ( (x  « BPLH)  » (BPLH)  ); 

12  #endif 

13  } 

The  inverse  function  is 

1 static  inline  ulong  bit_unzip2 (ulong  lo , ulong  hi) 

2 //  Inverse  of  bit_zip2(x,  lo,  hi). 

3 { 

4 #if  1 

5 return  bit_unzip(  (hi«l)  I lo  ); 

6 #6lse 

7 return  bit_unzipO (lo)  I (bit_unzipO(hi)  « BPLH); 

8 #endif 

9 } 

Functions  that  zip/unzip  the  bits  of  the  lower  half  of  two  words  are 

1 static  inline  ulong  bit_zip2 (ulong  x,  ulong  y) 

2 //  2-word  version: 

3 //  only  the  lower  half  of  x and  y are  merged 

4 { 

5 return  bit_zip(  (y<<BPLH)  + x ); 

6 } 

and  (64-bit  version) 

1 static  inline  void  bit_unzip2 (ulong  t,  ulong  &x,  ulong  & y) 

2 //  2-word  version: 

3 //  only  the  lower  half  of  x and  y are  filled 

4 { 

5 t = bit_unzip(t)  ; 

6 y = t » BPLH; 

7 x = t & OxOOOOOOOOf fff f fffUL; 

8 } 
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Figure  1.16-A:  Binary  words,  their  Gray  code,  inverse  Gray  code,  and  Gray  codes  of  even  and  odd 


values  (from  left  to  right). 


The  Gray  code  of  a binary  word  can  easily  be  computed  by  [FXT:  bits/graycode.h 
1 static  inline  ulong  gray_code (ulong  x)  { return  x * (x>>!) ; } 


Gray  codes  of  consecutive  values  differ  in  one  bit.  Gray  codes  of  values  that  differ  by  a power  of  2 differ 
in  two  bits.  Gray  codes  of  even/odd  values  have  an  even/odd  number  of  bits  set,  respectively.  This  is 
demonstrated  in  [FXT:  bits/gray-demo. cc  , whose  output  is  given  in  figure  1.16-A 


To  produce  a random  value  with  an  even/odd  number  of  bits  set,  set  the  lowest  bit  of  a random  number 
to  0/1,  respectively,  and  return  its  Gray  code. 


Computing  the  inverse  Gray  code  is  slightly  more  expensive.  As  the  Gray  code  is  the  bit-wise  difference 
modulo  2,  we  can  compute  the  inverse  as  bit-wise  sums  modulo  2: 


1 static  inline  ulong  inverse_gray_code (ulong  x) 

2 { 

3 //  VERSION  1 (integration  modulo  2) : 

4 ulong  h=l,  r=0; 

5 do 

6 1 

7 if  ( x & 1 ) r~=h; 

8 x »=  1; 

9 h = (h«l)  + l ; 

10  } 

11  while  ( x ! =0  ) ; 

12  return  r; 

13  } 


For  n-bit  words,  n-fold  application  of  the  Gray  code  gives  back  the  original  word.  Using  the  symbol  G 
for  the  Gray  code  (operator),  we  have  Gn  = id,  so  Gn_1  o G = id  = G_1  o G.  That  is,  applying  the  Gray 
code  computation  n — 1 times  gives  the  inverse  Gray  code.  Thus  we  can  simplify  to 

1 //  VERSION  2 (apply  graycode  BITS_PER_L0NG-1  times) : 

2 ulong  r = BITS_PER_L0NG; 

3 while  ( — r ) x “=  x>>1; 

4 return  x; 
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Applying  the  Gray  code  twice  is  identical  to  x~=x»2; , applying  it  four  times  is  x~=x>>4;,  and  the  idea 
holds  for  all  power  of  2.  This  leads  to  the  most  efficient  way  to  compute  the  inverse  Gray  code: 

1 //  VERSION  3 (use:  gray  **  BITSPERLONG  ==  id): 

2 x ~=  x>>1 ; //  gray  **  1 

3 x ~=  x»2 ; //  gray  **  2 

4 x ~=  x>>4 ; //  gray  **  4 

5 x ~=  x>>8 ; //  gray  **  8 

6 x ~=  x»16 ; //  gray  **  16 

7 //  here:  x = gray**31 (input) 

8 //  note:  the  statements  can  be  reordered  at  will 

9 #if  BITS_PER_LONG  >=  64 

10  x “=  x>>32;  //  for  64bit  words 

11  #endif 

12  return  x; 


1.16.1  The  parity  of  a binary  word 

The  parity  of  a word  is  its  bit-count  modulo  2.  The  lowest  bit  of  the  inverse  Gray  code  of  a word  contains 
the  parity  of  the  word.  So  we  can  compute  the  parity  as  [FXT:  bits/parity.h|: 

1 static  inline  ulong  parity(ulong  x) 

2 //  Return  0 if  the  number  of  set  bits  is  even,  else  1 

3 { 

4 return  inverse_gray_code (x)  & 1; 

5 } 

Each  bit  of  the  inverse  Gray  code  contains  the  parity  of  the  partial  input  left  from  it  (including  itself). 

Be  warned  that  the  parity  flag  of  many  CPUs  is  the  complement  of  the  above.  With  the  x86-architecture 
the  parity  bit  also  only  takes  into  account  the  lowest  byte.  The  following  routine  computes  the  parity  of 
a full  word  [FXT:  bits/bitasm-i386.h  : 

1 static  inline  ulong  asm_parity (ulong  x) 

2 { 

3 x ~=  (x»16)  ; 

4 x ~=  (x»8)  ; 

5 asm  ("addl  $0,  7.0  \n" 

6 "setnp  'll al  \n" 

7 "movzx  'll  al,  7,0" 

8 : "=r"  (x)  : "0"  (x)  : "eax"); 

9 return  x; 

10  } 

The  equivalent  code  for  the  AMD64  CPU  is  [FXT:  bits/bitasm-amd64.h|: 

1 static  inline  ulong  asm_parity (ulong  x) 

2 { 

3 x ~=  (x»32)  ; 

4 x ~=  (x»16)  ; 

5 x ~=  (x»8)  ; 

6 asm  ("addq  $0,  7,0  \n" 

7 "setnp  'll  al  \n" 

8 "movzx  'll  al,  7,0" 

9 : "=r"  (x)  : "0"  (x)  : "rax"); 

10  return  x; 

11  } 


1.16.2  Byte-wise  Gray  code  and  parity 

A byte-wise  Gray  code  can  be  computed  using  (32-bit  version) 

1 static  inline  ulong  byte_gray_code (ulong  x) 

2 //  Return  the  Gray  code  of  bytes  in  parallel 

3 { 

4 return  x ((x  & Oxf ef ef ef e) >>1) ; 

5 } 

Its  inverse  is 

1 static  inline  ulong  byte_inverse_gray_code (ulong  x) 

2 //  Return  the  inverse  Gray  code  of  bytes  in  parallel 

3 { 
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4 x ~=  ((x  & Oxf ef ef ef eUL) >>1) ; 

5 x “=  ((x  & Oxf cf cf cf clIL) >>2)  ; 

6 x ~=  ((x  & Oxf Of Of Of OUL) >>4) ; 

7 return  x ; 

8 } 

And  the  parities  of  all  bytes  can  be  computed  as 

1 static  inline  ulong  byte_parity (ulong  x) 

2 //  Return  the  parities  of  bytes  in  parallel 

3 { 

4 return  byte_inverse_gray_code (x)  & OxOlOlOlOlUL ; 

5 } 


1.16.3  Incrementing  (counting)  in  Gray  code 
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Figure  1.16-B:  The  Gray  code  equals  the  Gray  code  of  doubled  value  shifted  to  the  right  once.  Equiv- 


alently, we  can  separate  the  lowest  bit  which  equals  the  parity  of  the  other  bits.  The  last  column  shows 
that  the  changes  with  each  increment  always  happen  one  position  left  of  the  rightmost  bit. 


Let  g(k)  be  the  Gray  code  of  a number  k.  We  are  interested  in  efficiently  generating  g{k  + 1).  We  can 
implement  a fast  Gray  counter  if  we  use  a spare  bit  to  keep  track  of  the  parity  of  the  Gray  code  word, 
see  figure  1.16-B  The  following  routine  does  this  [FXT:  bits/nextgray.h  : 


1 static  inline  ulong  next _gray2 (ulong  x) 

2 //  With  input  x==gray_code (2*k)  the  return  is  gray_code (2*k+2) . 

3 //  Let  xl  be  the  word  x shifted  right  once 

4 //  and  il  its  inverse  Gray  code. 

5 //  Let  rl  be  the  return  r shifted  right  once. 

6 //  Then  rl  = gray_code (il+1) . 

7 //  That  is,  we  have  a Gray  code  counter. 

8 //  The  argument  must  have  an  even  number  of  bits . 

9 { 

10  x ~=  1; 

11  x ~=  (lowest_one(x)  « 1); 

12  return  x; 

13  } 


Start  with  x=0,  increment  with  x=next_gray2(pg)  and  use  the  words  g=x»l: 

1 ulong  x = 0; 

2 for  (ulong  k=0;  k<n2;  ++k) 

3 { 

4 ulong  g = x»l; 

5 x = next_gray2 (x) ; 

6 //  here:  g ==  gray_code (k) ; 

7 > 

8 


This  is  shown  in  [FXT:  bits/bit- nextgray-demo.ee  . To  start  at  an  arbitrary  (Gray  code)  value  g,  compute 
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x = (g«l)  " parity (g) 

Then  use  the  statement  x=next_gray2(x)  for  later  increments. 

If  working  with  a set  whose  elements  are  the  set  bits  in  the  Gray  code,  the  parity  is  the  set  size  k modulo 
2.  Compute  the  increment  as  follows: 


1.  If  k is  even,  then  goto  step  2,  else  goto  step  3. 

2.  If  the  first  element  is  zero,  then  remove  it,  else  prepend  the  element  zero. 

3.  If  the  first  element  equals  the  second  minus  one,  then  remove  the  second  element,  else  insert  at  the 
second  position  the  element  equal  to  the  first  element  plus  one. 


A method  to  decrement  is  obtained  by  simply  swapping  the  actions  for  even  and  odd  parity. 

When  working  with  an  array  that  contains  the  elements  of  the  set,  it  is  more  convenient  to  do  the  described 
operations  at  the  end  of  the  array.  This  leads  to  the  (loopless)  algorithm  for  subsets  in  minimal-change 
order  given  in  section  8.2.2  on  page  206  Properties  of  the  Gray  code  are  discussed  in  [l'27j . 


1.16.4  The  Thue-Morse  sequence 

The  sequence  of  parities  of  the  binary  words 

011010011001011010010110011010011001011001101001. . . 

is  called  the  Thue-Morse  sequence  (entry  A010060  in  |3l2]h  It  appears  in  various  seemingly  unrelated  con- 
texts, see  [8]  and  section  38.1  on  page  726  The  sequence  can  be  generated  with  [FXT:  class  thue_morse 
in  bits/thue- morse. h : 

class  thue_morse 
//  Thue-Morse  sequence 
{ 

public : 

ulong  k_; 
ulong  tm_ ; 


public : 


9 

thue_morse (ulong  k=0) 

10 

~thue_morse()  { ; } 

11 

12 

ulong  init (ulong  k=0) 

13 

■c 

14 

k_  = k; 

15 

tm_  = parity(k_) ; 

16 

return  tm  ; 

17 

> 

18 

19 

ulong  dataO  { return 

20 

21 

ulong  nextO 

22 

{ 

23 

ulong  x = k_  (k_ 

24 

++k_ ; 

25 

x “=  x»l; 

26 

x &=  0x55555555555! 

27 

tm_  “=  ( x!=0  ); 

28 

return  tm  ; 

29 

} 

30  ] 

h; 

{ init(k) ; } 


+ 1)  ; 

//  highest  bit  that  changed  with  increment 
55555UL;  //  64-bit  version 

//  change  if  highest  changed  bit  was  at  even  index 


The  rate  of  generation  is  about  366  M/s  (6  cycles  per  update)  [FXT:  bits/thue-morse-demo.cc  . 


1.16.5  The  Golay-Rudin-Shapiro  sequence  f 

The  function  [FXT:  bits/grsnegative.h 

1 static  inline  ulong  grs_negative_q(ulong  x)  { return  parity ( x & (x>>1)  ) ; } 

returns  +1  for  indices  where  the  Golay-Rudin-Shapiro  sequence  (or  GRS  sequence,  entry  A020985  in 
m has  the  value  — 1.  The  algorithm  is  to  count  the  bit-pairs  modulo  2.  The  pairs  may  overlap:  the 
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+++- 

+ + - + 

+++- 

+ + - + 

+ + + - 

+ - 

+++- 

+ + - + 

++  + - 

+-  +++- 

+ + - + + + + - + 

+++- 

+ + - + 

+ + + - 

+-  +++- 

++-+  + ++-+  +++-  ++-+  +++-  +-  . . . 

3 

6, 

11,12 

,13,15,  19 

22,  ::: 

Figure  1.16-C: 

A construction  for  the  Golay-Rudin-Shapiro  (GRS)  sequence. 

sequence  [1111]  contains  the  three  bit-pairs  [11..],  [.11.],  and  [.  .11],  The  function  returns  +1  for 
x in  the  sequence 


3,  6,  11,  12,  13,  15,  19,  22,  24,  25,  26,  30,  35,  38,  43,  44,  45,  47,  48,  49,  50,  52,  53,  ... 


This  is  entry  A022155  in  [[3121 . see  also  section  38.3  on  page  731  The  sequence  can  be  computed  by 


starting  with  two  ones,  and  appending  the  left  half  and  the  negated  right  half  of  the  values  so  far  in  each 
step,  see  figure  [P6-C  To  compute  the  successor  in  the  GRS  sequence,  use 


static  inline  ulong  grs_next (ulong  k,  ulong  g) 

//  With  g ==  grs_negative_q(k) , compute  grs_negative_q(k+l) . 

{ 

const  ulong  cm  = 0x5555555555555554UL ; //  64-bit  version 

ulong  h = ~k;  h &=  -h;  //  ==  lowest_zero(k) ; 

g ~=  ( ((h&cm)  “ ((k»l)&h))  ! =0  ); 
return  g; 

} 


With  incrementing  k,  the  lowest  run  of  ones  of  k is  replaced  by  a one  at  the  lowest  zero  of  k.  If  the  length 
of  the  lowest  run  is  odd  and  > 2 then  a change  of  parity  happens.  This  is  the  case  if  the  lowest  zero  of  k 
is  at  one  of  the  positions 


bin  0101  0101  0101  0100  ==  hex  5 5 5 4 ==  cm 


the  position  of  the  lowest  zero  is  adjacent  to  the  next  block  of  ones,  another  change  of  parity  will  occur. 
The  element  of  the  GRS  sequence  changes  if  exactly  one  of  the  parity  changes  takes  place. 

The  update  function  can  be  used  as  shown  in  [FXT:  bits/grs-next-denro.cc|: 

ulong  n = 65;  //  Generate  this  many  values  of  the  sequence, 
ulong  kO  = 0;  //  Start  point  of  the  sequence, 
ulong  g = grs_negative_q(kO) ; 
for  (ulong  k=k0;  k<k0+n;  ++k) 

//  Do  something  with  g here . 
g = grs_next(k,  g) ; 

} 

The  rate  of  generation  is  about  347  M/s,  direct  computation  gives  a rate  of  313  M/s. 


1.16.6  The  reversed  Gray  code 


We  define  the  reversed  Gray  code  to  be  the  bit-reversed  word  of  the  Gray  code  of  the  bit-reversed  word. 
That  is, 

rev_gray_code (x)  :=  revbin(  gray_code(  revbin(x)  ) ) 

It  turns  out  that  the  corresponding  functions  are  identical  to  the  Gray  code  versions  up  to  the  reversed 
shift  operations  (C-language  operators  *»’  replaced  by  !«’).  So  computing  the  reversed  Gray  code  is  as 
easy  as  [FXT:  bits/revgraycode.h  : 


static  inline  ulong  rev_gray_code (ulong  x)  { return 


(x«l)  ; } 


inverse  is 

static  inline  ulong  inverse_rev_gray_code (ulong  x) 
{ 

//  use:  rev_gray  **  BITSPERLONG  ==  id: 
x ~=  x<<l;  //  rev_gray  **  1 

x ~=  x<<2;  //  rev_gray  **  2 

x “=  x<<4;  //  rev_gray  **  4 
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111.1111 1111 

1.  .11. . .1. . .1. . .1 

. .11.  . .1. . .1. . .1 

1.11.1.11111.1.11111111111111111 
1.1.  .1.1 1.1 


= OxefOfOOOO  ==  word 
gray_code 
rev_gray_code 
inverse_gray_code 
inverse_rev_gray_code 


. . .1 1111 1111111111111111 

. . .11. . .1. . .1. . .1 

. .11.  . .1.  . .1.  . .1 1 

. . .11111.1.11111.1.1.1.1.1.1.1.1 
1111 1.1 1.1. 1.1. 1.1. 1.1 


= OxlOfOffff  ==  word 
gray_code 
rev_gray_code 
inverse_gray_code 
inverse_rev_gray_code 


. .1.  . 
. .11. 
.11.  . 


11111111111111111111111111 

1111111 


= 0x2000000  ==  word 

gray_code 
rev_gray_code 
inverse_gray_code 
inverse_rev_gray_code 


111111.1111111111111111111111111 

1 11 

11 1 

1.1.1.  .1.1. 1.1. 1.1. 1.1. 1.1. 1.1.1 
1.1.1.11.1.1.1.1.1.1.1.1.1.1.1.1 


= Oxfdffffff  ==  word 
gray_code 
rev_gray_code 
inverse_gray_code 
inverse_rev_gray_code 


Figure  1.16-D:  Examples  of  the  Gray  code,  reversed  Gray  code,  and  their  inverses  with  32-bit  words. 


x ~=  x<<8;  //  rev_gray  **  8 
x ~=  x<<16;  //  rev_gray  **  16 
//  here:  x = rev_gray**31 (input) 

//  note:  the  statements  can  be  reordered  at  will 
#if  BITS_PER_L0NG  >=  64 

x ~=  x<<32;  //  for  64bit  words 

#endif 

return  x ; 

} 

Some  examples  with  32-bit  words  are  shown  in  figure  [hi 6-D| 

Let  G and  E denote  be  the  Gray  code  and  reversed  Gray  code  of  a word  X,  respectively.  Write  G_1 
and  E~x  for  their  inverses.  Then  E preserves  the  lowest  bit  of  A',  while  E preserves  the  highest.  Also  E 
preserves  the  lowest  set  bit  of  X , while  E preserves  the  highest.  Further,  E _1  contains  at  each  bit  the 
parity  of  all  bits  of  X right  from  it,  including  the  bit  itself.  Especially,  the  word  parity  can  be  found  in 
the  highest  bit  of  E-1. 

Let  X denote  the  complement  of  X,  p its  parity,  and  let  S the  right  shift  by  one  of  G~l.  Then  we  have 


9 

10 

11 

12 


15 


G_1  XOR  E -1 
S XOR  E-1 


f X if  p = 0 
[ X otherwise 

| 0 if  p = 0 
( 0 otherwise 


(1.16-la) 

(1.16-lb) 


We  note  that  taking  the  reversed  Gray  code  of  a binary  word  corresponds  to  multiplication  with  the 
binary  polynomial  x + 1 and  the  inverse  reversed  Gray  code  is  a method  for  fast  exact  division  by  x + 1, 
see  sectioned. 1.6|on  page  826  The  inverse  reversed  Gray  code  can  be  used  to  solve  the  reduced  quadratic 
equation  for  binary  normal  bases,  see  section  42.6. 2|on  page  903 


1.17  Bit  sequency  J 

The  sequency  of  a binary  word  is  the  number  of  zero-one  transitions  in  the  word.  A function  to  determine 
the  sequency  is  [FXT:  bits/bitsequency.hj: 

1 static  inline  ulong  bit_sequency (ulong  x)  { return  bit_count(  gray_code(x)  );  1 


Oi  cn  4^  co  tsD  i— 1 Qi  cn  4^  Co  to  i — 1 cn  4^  Co  to  i — 1 cotoi— 1 
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seq= 

0 
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2 

3 

4 

5 
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1 
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...1.1 
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1.1.1. 

11 

. . .11. 

. .11.1 

.11.1. 

11.1.1 

. . .111 

. . .1.  . 

. .1.  .1 

.1 

. . 1. 

1.  .1.1 

. .1111 

. .111. 

. .1.11 

.1 

. 11. 

1.11.1 

. 11111 

. .11.  . 

.111.1 

.1 

.1.  . 

1.1.  .1 

mill 

. .1.  . . 

.11.  .1 

111.1. 

1.1.11 

.1111. 

.11.11 

11 

. . 1. 

.111.  . 

.1.  . .1 

11 

. 11. 

.11.  . . 

.1.  .11 

11 

.1.  . 

.1 

.1.111 

1. 

. . 1. 

11111. 

1111.1 

1. 

. 11. 

1111.  . 

111.  . 1 

1. 

.1.  . 

111.  . . 

111.11 

1. 

111. 

11 

11.  . . 1 

1. 

11.  . 

1 

11. . 11 
11.111 

1 1 

1. . . 11 
1.  .111 

1.1111 

1. 

1.  . . 

Figure  1.17-A:  6-bit  words  of  prescribed  sequency  as  generated  by  next_sequency  () . 


The  function  assumes  that  all  bits  to  the  left  of  the  word  are  zero  and  all  bits  to  the  right  are  equal  to 
the  lowest  bit,  see  figure [T717-A  For  example,  the  sequency  of  the  8-bit  word  [00011111]  is  one.  To  take 
the  lowest  bit  into  account,  add  it  to  the  sequency  (then  all  sequencies  are  even). 


The  minimal  binary  word  with  given  sequency  can  be  computed  as  follows: 

1 static  inline  ulong  f irst_sequency (ulong  k) 

2 //  Return  the  first  (i.e.  smallest)  word  with  sequency  k, 

3 //  e.g.  00. .00010101010  (seq  8) 

4 //e.g.  00. .00101010101  (seq  9) 

5 //  Must  have:  0 <=  k <=  BITS_PER_L0NG 

6 { 

7 return  inverse_gray_code ( f irst_comb(k)  ); 

8 } 


A faster  version  is  (32-bit  branch  only): 


if  ( k==0  ) return  0 ; 
const  ulong  m = OxaaaaaaaaUL ; 
return  m » (BITS_PER_L0NG-k) ; 

The  maximal  binary  word  with  given  sequency  can  be  computed  via 

static  inline  ulong  last_sequency (ulong  k) 

//  Return  the  last  (i.e.  biggest)  word  with  sequency  k. 

{ 

return  inverse_gray_code ( last_comb(k)  ); 

} 


The  functions  f irst_comb(k)  and  last_comb(k)  return  a word  with  k bits  set  at  the  low  and  high  end, 
respectively  (see  section  1.24  on  page  62). 

For  the  generation  of  all  words  with  a given  sequency,  starting  with  the  smallest,  we  use  a function  that 
computes  the  next  word  with  the  same  sequency: 


static  inline  ulong  next_sequency (ulong  x) 
{ 

x = gray_code (x) ; 
x = next_colex_comb(x) ; 
x = inverse_gray_code(x) ; 
return  x; 

} 


e inverse  function,  returning  the  previous  word  with  the  same  sequency,  is 

static  inline  ulong  prev_sequency (ulong  x) 

{ 

x = gray_code (x) ; 
x = prev_colex_comb(x) ; 
x = inverse_gray_code(x) ; 
return  x; 
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7 } 

The  list  of  all  6-bit  words  ordered  by  sequency  is  shown  in  figure  [T7l  7- A[  It  was  created  with  the  program 
[FXT:  bits/bitsequency-demo.cc  . 

The  sequency  of  a word  can  be  complemented  as  follows  (32-bit  version): 

1 static  inline  ulong  complement_sequency (ulong  x) 

2 //  Return  word  whose  sequency  is  BITS_PER_LONG  - s 

3 //  where  s is  the  sequency  of  x 

4 { 

5 return  x * OxaaaaaaaaUL ; 

6 } 

1.18  Powers  of  the  Gray  code  J 
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Figure  1.18-A:  Powers  of  the  matrices  for  the  Gray  code  (top)  and  the  reversed  Gray  code  (bottom). 


The  Gray  code  is  a bit-wise  linear  transform  of  a binary  word.  The  2fe-th  power  of  the  Gray  code  of  x 
can  be  computed  as  x (x>>k) . The  e-th  power  can  be  computed  as  the  bit-wise  sum  of  the  powers 
corresponding  to  the  bits  in  the  exponent.  This  motivates  [FXT:  bits/graypower.h|: 

1 static  inline  ulong  gray_pow (ulong  x,  ulong  e) 

2 //  Return  (gray_code**e) (x) 

3 //  gray_pow(x,  1)  ==  gray_code(x) 

4 //  gray_pow(x,  BITS_PER_L0NG-1)  ==  inverse_gray_code (x) 

5 { 

6 e &=  (BITS_PER_L0NG-1) ; //  modulo  BITS_PER_LONG 

7 ulong  s = 1 ; 

8 while  ( e ) 

9 I 

10  if  ( e & 1 ) x “=  x » s;  //  gray  **  s 

11  s «=  1; 

12  e »=  1; 

13  > 

14  return  x; 

15  } 

The  Gray  code  g = [go,  9\i  ■■■,  9t\  of  a 8-bit  binary  word  x = [xq,  X\, . . . , X7]  can  be  expressed  as  a 
matrix  multiplication  over  GF(2)  (dots  for  zeros): 


g 

G 

X 

[go] 

[ 11 ] 

[xO] 

[gl] 

[ .11 ] 

[xl] 

[g2] 

[ ..11....  ] 

[x2] 

[g3]  = 

[ ...11...  ] 

[x3] 

[g4] 

[ ....11..  ] 

[x4] 

Cg5] 

[ 11.  ] 

[x5] 

[g6] 

[ 11  ] 

[x6] 

[g7] 

[ 1 ] 

[x7] 

The  powers  of  the  Gray  code  correspond  to  multiplication  with  powers  of  the  matrix  G,  shown  in  hg- 
T.18-A|  (bottom).  The  powers  of  the  inverse  Gray  code  for  IV-bit  words  (where  A is  a power  of  2) 


ure 
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can  be  computed  by  the  relation  Ge  GN  e = GN  = id. 


1 static  inline  ulong  inverse_gray_pow(ulong  x,  ulong  e) 

2 //  Return  (inverse_gray_code** (e) ) (x) 

3 //  ==  (gray_code** (-e) ) (x) 

4 //  inverse_gray_pow(x,  1)  ==  inverse_gray_code(x) 

5 //  inverse_gray_pow(x,  BITS_PER_LQNG-1)  ==  gray_code(x) 

6 { 

7 return  gray_pow(x,  -e) ; 

8 } 


The  matrices  corresponding  to  the  powers  of  the  reversed  Gray  code  are  shown  in  figure  1.18-A  (bottom). 
We  just  have  to  reverse  the  shift  operator  in  the  functions: 


1 static  inline  ulong  rev_gray_pow (ulong  x,  ulong  e) 

2 //  Return  (rev_gray_code**e) (x) 

3 { 

4 e &=  (BITS_PER_L0NG-1) ; //  modulo  BITS_PER_LONG 

5 ulong  s = 1 ; 

6 while  ( e ) 

7 { 

8 if(e&l)x~=x«s;//  rev_gray  **  s 

9 s «=  1; 

10  e »=  1; 

11  } 

12  return  x; 

13  } 


The  inverse  function  is 

1 static  inline  ulong  inverse_rev_gray_pow (ulong  x,  ulong  e) 

2 //  Return  (inverse_rev_gray_code** (e) ) (x) 

3 { 

4 return  rev_gray_pow(x , -e) ; 

5 } 


1.19  Invertible  transforms  on  words  J 


The  functions  presented  in  this  section  are  invertible  transforms  on  binary  words.  The  names  are  chosen 
as  ‘some  code’,  emphasizing  the  result  of  the  transforms,  similar  to  the  convention  used  with  the  name 
‘Gray  code’.  The  functions  are  given  in  [FXT:  bits/bittransforms.h|. 

In  the  transform  ( blue  code ) 

1 static  inline  ulong  blue_code (ulong  a) 

2 { 

3 ulong  s = BITS_PER_L0NG  » 1; 

4 ulong  m = ~0UL  « s; 

5 do 

6 { 

7 a “=  ( (a&m)  » s ) ; 

8 s »=  1; 

9 m “=  (m>>s) ; 

10  } 

11  while  ( s ) ; 

12  return  a; 

13  } 

the  masks  ‘m’  are  (32-bit  binary) 


11111111111111 

111111 11111111 

11 1111 1111 1111. . . 

_. .11. .11. .11. .11. .11. .11. .11. 
.1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1.1 


The  same  masks  are  used  in  the  yellow  code 

1 static  inline  ulong  yellow_code (ulong  a) 

2 { 

3 ulong  s = BITS_PER_L0NG  » 1; 

4 ulong  m = ~0UL  » s; 

5 do 

6 { 

7 

8 


a “=  ( (a&m)  « s ) ; 
s »=  1; 
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9 m “=  (m<<s) ; 

10  > 

11  while  ( s ) ; 

12  return  a; 

13  } 

Both  need  O (log2  BITS_PER_L0NG)  operations.  The  blue_code  can  be  used  as  a fast  implementation  for 
the  composition  of  a binary  polynomial  with  x + 1,  see  section  |40.7.2|on  page  |845|  The  yellow  code  can 
also  be  computed  by  the  statement 

revbin(  blue_code(  revbin(x)  ) ); 

So  we  could  have  called  it  reversed  blue  code.  Note  the  names  ‘blue  code’  etc.  are  ad  hoc  terminology 
and  not  standard.  See  section  |23.11|  on  page  |486|  for  the  closely  related  Reed-Muller  transform. 
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...  11. 

2* 

.11. .11. .11. .11. .11. .11. .11. .11. 

16 

7 

...  Ill 

3* 

1. .11. .11. .11. .11. .11. .11.  .11.  .1 

16 

8 

. .1111 

4 

1. . .1. . .1. . .1. . .1. . .1. . .1.  . .1.  . . 

8 

9 

. .111. 

3 

.111.111.111.111.111.111.111.111 

24 

10 

. .11.  . 

2 

. .1. . .1. . .1. . .1. . .1. . .1. . .1. . .1. 

8 

11 

. .11.1 

3 

11.111.111.111.111.111.111.111.1 

24 

12 

. .1.1. 

2 

.1. . .1. . .1. . .1. . .1. . .1. . .1.  . .1.  . 

8 

13 

. .1.11 

3 

1.111.111.111.111.111.111.111.11 

24 

14 

. .1.  .1 

2 

111.111.111.111.111.111.111.111. 

24 

15 

. .1.  . . 

1 

. . .1. . .1. . .1. . .1. . .1. . .1.  . .1.  . .1 

8 

16 

.1.  . .1 

2 

1111 1111 1111 1111 

16 

17 

.1 

1 

1111 1111 1111 1111 

16 

18 

.1. .1. 

2* 

.1.11.1. .1.11.1. .1.11.1. .1.11.1. 

16 

19 

.1.  .11 

3* 

1.1. .1.11.1. .1.11.1. .1.11.1.  .1.1 

16 

20 

.1.1.  . 

2* 

. .1111 1111 1111 1111.  . 

16 

21 

.1.1.1 

3* 

11 1111 1111 1111 11 

16 

22 

.1.111 

4 

1. .1.11.1. .1.11.1. .1.11.1. .1.11. 

16 

23 

.1.11. 

3 

.11.1. .1.11.1. .1.11.1. .1.11.1.  .1 

16 

24 

.1111. 

4 

.1111 1111 1111 1111.  . . 

16 

25 

.11111 

5 

1 1111 1111 1111 111 

16 

26 

.111.1 

4 

11.1. .1.11.1. .1.11.1. .1.11.1. .1. 

16 

27 

.111.  . 

3 

. .1.11.1. .1.11.1. .1.11.1. .1.11.1 

16 

28 

.11.11 

4 

1.11.1. .1.11.1. .1.11.1. .1.11.1.  . 

16 

29 

.11.1. 

3 

.1. .1.11.1. .1.11.1. .1.11.1. .1.11 

16 

30 

.11.  . . 

2 

. . .1111 1111 1111 1111. 

16 

31 

.11.  .1 

3 

Ill 1111 1111 1111 1 

16 

Figure  1.19-A:  Blue  and  yellow  transforms  of  the  binary  words  0,  1,  . . . , 31.  Bit-counts  are  shown  at 


the  right  of  each  column.  Fixed  points  are  marked  with  asterisks. 


The  transforms  of  the  binary  words  up  to  31  are  shown  in  figure  1.19-A[  the  lists  were  created  with  the 
program  [FXT:  bits/bittransforms-blue-demo.cc  . The  parity  of  B(a)  is  equal  to  the  lowest  bit  of  a.  Up 
to  the  a = 47  the  bit-count  varies  by  ±1  between  successive  values  of  B(a ),  the  transition  U(47)  — ► B( 48) 
changes  the  bit-count  by  3.  The  sequence  of  the  indices  a where  the  bit-count  changes  by  more  than  one 
is 

47,  51,  59,  67,  75,  79,  175,  179,  187,  195,  203,  207,  291,  299,  339,  347,  419,  427,  ... 


The  yellow  code  might  be  a good  candidate  for  ‘randomization’  of  binary  words.  The  blue  code  maps 
any  range  [0 . . . 2fe  — 1]  onto  itself.  Both  the  blue  code  and  the  yellow  code  are  involutions  (self- inverse). 

The  transforms  {red  code ) 

1 static  inline  ulong  red_code(ulong  a) 

2 { 

3 ulong  s = BITS_PER_L0NG  » 1; 

4 ulong  m = ~0UL  » s; 

5 do 

6 1 

7 ulong  u = a & m; 

8 ulong  v = a * u; 

9 a = v (u<<s) ; 

10  a “=  (v>>s)  ; 

11  s »=  1; 
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0 

red 

0 

green 

0 

1 

1 

1 

11111111111111111111111111111111  32 

2 

11 

2 

.1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1  16 

3 

.1 

1 

1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1.  16 

4 

1.1 

2 

. .11. .11. .11. .11. .11. .11. .11. .11  16 

5 

. .1 

1 

11. .11. .11. .11.  .11.  .11.  .11.  .11.  . 16 

6 

.11 

2 

.11.  .11.  .11.  .11. .11.  .11.  .11.  .11.  16 

7 

Ill 

3 

1. .11. .11. .11. .11. .11. .11. .11. .1  16 

8 

1111 

4 

. . .1. . .1. . .1. . .1. . .1. . .1. . .1. . .1  8 

9 

.111 

3 

111.111.111.111.111.111.111.111.  24 

10 

. .11 

2 

.1.  . .1.  . .1.  . .1.  . .1. . .1. . .1. . .1. . 8 

11 

1.11 

3 

1.111.111.111.111.111.111.111.11  24 

12 

.1.1 

2 

..1...1...1...1...1...1...1...1.  8 

13 

11.1 

3 

11.111.111.111.111.111.111.111.1  24 

14 

1.  .1 

2 

.111.111.111.111.111.111.111.111  24 

15 

...  1 

1 

1...1...1...1...1...1...1...1...  8 

16 

1.  . .1 

2 

1111 1111 1111 1111  16 

17 

1 

1 

1111 1111 1111 1111 16 

18 

.1.  .1 

2 

.1.11.1. .1.11.1. .1.11.1.  .1.11.1.  16 

19 

11.  .1 

3 

1.1. .1.11.1. .1.11.1. .1.11.1. .1.1  16 

20 

. .1.1 

2 

..1111 1111 1111 1111..  16 

21 

1.1.1 

3 

11 1111 1111 1111 11  16 

22 

111.1 

4 

.11.1. .1.11.1. .1.11.1. .1.11.1. .1  16 

23 

.11.1 

3 

1.  .1.11.1.  .1.11.1. .1.11.1. .1.11.  16 

24 

. 1111 

4 

...1111 1111 1111 1111.  16 

25 

11111 

5 

111 1111 1111 1111 1 16 

26 

1.111 

4 

.1.  .1.11.1.  .1.11.1. .1.11.1. .1.11  16 

27 

. .111 

3 

1.11.1.  .1.11.1. .1.11.1. .1.11.1. . 16 

28 

11.11 

4 

. .1.11.1.  .1.11.1.  .1.11.1.  .1.11.1  16 

29 

.1.11 

3 

11.1.  .1.11.1. .1.11.1. .1.11.1. .1.  16 

30 

. . .11 

2 

.1111 1111 1111 1111...  16 

31 

1.  .11 

3 

1 1111 1111 1111 111  16 

Figure  1.19-B: 

Red  and  green  transforms  of  the  binary  words  0,  1,  . . . , 31. 

12  m “=  (m<<s) ; 

13  } 

14  while  ( s ) ; 

15  return  a; 

16  } 

and  ( green  code) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 


static  inline  ulong  green_code (ulong  a) 
{ 

ulong  s = BITS_PER_L0NG  » 1; 
ulong  m = ~0UL  « s; 
do 

ulong  u = a & m; 
ulong  v = a * u; 
a = v (u>>s) ; 
a “=  (v<<s) ; 
s »=  1; 
m “=  (m>>s) ; 

> 

while  ( s ) ; 
return  a; 

} 


use  the  masks 


111111111111111 

11111111 1111111 

. . .1111 1111 1111 111 

.11. .11. .11. .11. .11. .11. .11. .1 
1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1.1. 


The  transforms  of  the  binary  words  up  to  31  are  shown  in  figure  |1.19-B[  which  was  created  with  the 
program  [FXT:  bits/bittransforms-red-demo.cc  . The  red  code  can  also  be  computed  by  the  statement 


revbin(  blue_code(  x ) ); 


and  the  green  code  by 

blue_code(  revbin(  x ) ); 
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i 

r 

B 

Y 

R 

E 

i 

i 

r 

B 

Y 

R 

E 

r 

r 

i 

R* 

E* 

B* 

Y* 

B 

B 

E* 

i 

R* 

Y* 

r* 

Y 

Y 

R* 

E* 

i 

r* 

B* 

R 

R 

y* 

r* 

B* 

E 

i 

E 

E 

B* 

Y* 

r* 

i 

R 

Figure  1.19-C:  Multiplication  table  for  the  transforms. 


1.19.1  Relations  between  the  transforms 

We  write  B for  the  blue  code  (transform),  Y for  the  yellow  code  and  r for  bit-reversal  (the  revbin- 
function).  We  have  the  following  relations  between  B and  Y : 


B = YrY  = rYr 

(1.19-la) 

Y = Br  B = r B r 

(1.19-lb) 

r = Y BY  = BY B 

(1.19-lc) 

As  said,  B and  Y are  self-inverse: 

B~l  = B,  BB  = id 

(1.19-2a) 

Y"1  = Y,  YY  = id 

(1.19-2b) 

We  write  R for  the  red  code,  and  E for  the  green  code.  The  red  code  and  the  green 
involutions  (square  roots  of  identity)  but  third  roots  of  identity: 

code  are  not 

RRR  = id,  R~1=RR  = E 

(1.19-3a) 

EEE  = id,  E~1=EE  = R 

(1.19-3b) 

RE  = E R = id 

(1.19-3c) 

Figure  |1.19-C|  shows  the  multiplication  table.  The  R in  the  third  column  of  the  second 
r B = R.  The  letter  i is  used  for  identity  (id).  An  asterisk  says  that  xy^  yx. 

row  says  that 

By  construction  we  have 

R = r B 

(1.19-4a) 

E = rY 

(1.19-4b) 

Relations  between  R and  E are: 

R = Er  E = r Er 

(1.19-5a) 

E = Rr  R = r Rr 

(1.19-5b) 

R = RE  R 

(1.19-5c) 

E = ERE 

(1.19-5d) 

For  the  bit-reversal  we  have 

r = YR=RB=BE=EY 

(1.19-6) 

Some  products  for  the  transforms 

are 

B 

= RY  =Y  E = RBR  = EBE 

(1.19-7a) 

Y 

= EB  =BR  = RY  R = EY  E 

(1.19- 7b) 

R 

= BY  =BEB  = YEY 

(1.19-7c) 

E 

= YB  =BRB  = Y RY 

(1.19-7d) 
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Some  triple  products  that  give  the  identical  transform  are 


id 

= BYE 

= RYB 

(1.19-8a) 

id 

= E BY 

= B RY 

(1.19-8b) 

id 

= YEB 

= YBR 

(1.19-8c) 

1.19.2  Relations  to  Gray  code  and  reversed  Gray  code 

Write  g for  the  Gray  code,  then: 


gBgB 

= id 

(1.19-9a) 

gBg 

= B 

(1.19-9b) 

1 B g~1 

= B 

(1.19-9c) 

gB 

= Bg -1 

(1.19-9d) 

Let  Sk  be  the  operator  that  rotates  a word  by  k bits  (bit  0 is  moved  to  position  k),  then 


YS+1Y  = g 
Y S-i  Y = g -1 
YSkY  = gk 


(1.19-10a) 

(1.19-10b) 

(1.19-lOc) 


Shift  in  the  sequency  domain  is  bit-wise  derivative  in  time  domain.  Relation  |1.19-10c  together  with  an 
algorithm  to  generate  the  cycle  leaders  of  the  Gray  permutation  (section  2.12.1  on  page  128)  gives  a 
curious  method  to  generate  the  binary  necklaces  whose  length  is  a power  of  2,  described  in  section|i8.1.6| 
on  page  [3761  Let  e be  the  operator  for  the  reversed  Gray  code,  then 


BS+1B  = e-1 

B S -i  B = e 
BSkB  = e~k 


(1.19-lla) 

(1.19-llb) 

(1.19-llc) 


1.19.3  Fixed  points  of  the  blue  code  f 


0 

1 

= 

1 

1 

= 

0 

l 

16 

17 

= 

.1 

.1.  . .1 

.1. . .1 

.1.11.1.  . . 

= 

272 

360 

2 

= 

1. 

11. 

= 

6 

18 

= 

.1. .1. 

.1 1.  . 

= 

260 

3 

= 

11 

Ill 

= 

7 

19 

= 

.1. .11 

.1.11111.  . 

= 

380 

4 

= 

. . .1.  . 

1.1.  . 

= 

20 

20 

= 

.1.1.  . 

.1. . .1.11. 

= 

278 

5 

= 

. . .1.1 

1. .1. 

= 

18 

21 

= 

.1.1.1 

.1.11.111. 

= 

366 

6 

= 

. . .11. 

1.1.1 

= 

21 

22 

= 

.1.11. 

.1 1. 

= 

258 

7 

= 

. . .111 

1. .11 

= 

19 

23 

= 

.1.111 

.1.1111.1. 

= 

378 

8 

= 

. .1.  . . 

. . .1111.  . . 

= 

120 

24 

= 

.11.  . . 

.1. . .1.  . .1 

= 

273 

9 

= 

. .1.  .1 

.. .11.11.  . 

= 

108 

25 

= 

.11.  .1 

.1.11.1.  .1 

= 

361 

10 

= 

. .1.1. 

. . . linn. 

= 

126 

26 

= 

.11.1. 

.1 1.1 

= 

261 

1 1 

= 

. .1.11 

..  .li.i.i. 

= 

106 

27 

= 

.11.11 

.1.11111.1 

= 

381 

12 

= 

. .11.  . 

. . .mi.  ,i 

= 

121 

28 

= 

.111.  . 

.1. . .1.111 

= 

279 

13 

= 

. .11.1 

. . .11.11.1 

= 

109 

29 

= 

.111.1 

.1.11.1111 

= 

367 

14 

= 

. .111. 

. . . limn 

= 

127 

30 

= 

.1111. 

.1 11 

= 

259 

15 

= 

. .1111 

.. .li.i.ii 

107 

31 

= 

.11111 

.1.1111.11 

= 

379 

Figure  1.19-D:  The  first  fixed  points  of  the  blue  code.  The  highest  bit  of  all  fixed  points  lies  at  an  even 
index.  There  are  2n/2  fixed  points  with  highest  bit  at  index  n. 

The  sequence  of  fixed  points  of  the  blue  code  is  (entry  Al  18666  in  [312j ) 

0,  1,  6,  7,  18,  19,  20,  21,  106,  107,  108,  109,  120,  121,  126,  127,  258,  259,  ... 

If  / is  a fixed  point,  then  / XOR  1 is  also  a fixed  point.  Further,  2 (/  XOR  (2  /))  is  a fixed  point.  These 
facts  can  be  cast  into  a function  that  returns  a unique  fixed  point  for  each  argument  [FXT:  bits/blue- 
fixed-points. h : 


54 


Chapter  1:  Bit  wizardry 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


static  inline  ulong  blue_f ixed_point (ulong  s) 
{ 

if  ( 0==s  ) return  0; 


ulong 

while 

4 

f 

f 

f 

s 


f = 1; 

( s>l  ) 

~=  (f«l); 
«=  1; 

1=  (s&l); 
»=  1; 


return  f ; 


The  output  for  the  first  few  arguments  is  shown  in  figure  |1.19-D|  Note  that  the  fixed  points  are  not  in 
ascending  order.  The  list  was  created  by  the  program  [FXT:  bits/bittransforms-blue-fp-demo.cc|. 

Now  write  /(x)  for  the  binary  polynomial  corresponding  to  / (see  chapter  40  on  page  822|),  if  /(x)  is 
a fixed  point  (that  is,  B /(x)  = f(x  + 1)  = /(x)),  then  both  (x2  + x)  f(x)  and  1 + (x2  + x)  /(x)  are 
fixed  points.  The  function  blue_f  ixed_point()  repeatedly  multiplies  by  x2  + x and  adds  one  if  the 
corresponding  bit  of  the  argument  is  set. 

For  the  inverse  function,  we  exploit  that  polynomial  division  by  x + 1 can  be  done  with  the  inverse 
reversed  Gray  code  (see  section  1.16.6  on  page  45)  if  the  polynomial  is  divisible  by  x + 1: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


static  inline  ulong  blue_fixed_point_idx (ulong  f) 


// 

{ 


Inverse  of  blue_f ixed_point () 


ulong 

while 

4 

s 

s 

f 

f 

> 

return 


1; 

) 


«=  1; 

“=  (f  & 1); 

»=  1; 

= inverse_rev_gray_code(f ) ; 
s » 1; 


//  ==  bitpol_div(f , 3); 


1.19.4  More  transforms  by  symbolic  powering 


The  idea  of  powering  a transform  (as  with  the  Gray  code,  see  section  1.18  on  page  48)  can  be  applied  to 
the  ‘color ’-transforms  as  exemplified  for  the  blue  code: 

1 static  inline  ulong  blue_xcode (ulong  a,  ulong  x) 

2 { 


3 

x &=  (BITS_PER_ 

L0NG-1) ; 

//  mo 

4 

ulong  s = BITS. 

_PER_L0NG 

» 1; 

5 

ulong  m = ~0UL 

« s; 

6 

while  ( s ) 

7 

4 

8 

if  ( x & 1 

) a ~= 

( (a&m) 

9 

x »=  1; 

10 

s »=  1; 

11 

m “=  (m>>s) 

1 ; 

12 

} 

13 

return  a; 

14 

} 

The  result  is  not  the  power  of  the  blue  code  which  would  be  pretty  boring  as  B B = id.  The  transforms 
(and  the  equivalents  for  Y . R and  E,  see  [FXT:  bits/bitxtransforms.h  ) are  more  interesting:  all  relations 
between  the  transforms  are  still  valid,  if  the  symbolic  exponent  is  identical  with  all  terms  in  the  relation. 
For  example,  we  had  BB  = id,  now  Bx  Bx  = id  is  true  for  all  x.  Similarly,  EE  = R now  has  to  be 
Ex  Ex  = Rx.  That  is,  we  have  BITS_PER_L0NG  different  versions  of  our  four  transforms  that  share  their 
properties  with  the  ‘simple’  versions.  Among  them  are  BITS_PER_L0NG  transforms  Bx  and  Yx  that  are 
involutions  and  Ex  and  Rx  that  are  third  roots  of  the  identity:  Ex  Ex  Ex  = Rx  Rx  Rx  = id. 

While  not  powers  of  the  simple  versions,  we  still  have  B°  = Y°  = R°  = E°  = id.  Further,  let  e be  the 
‘exponent’  of  all  ones  and  Z be  any  of  the  transforms,  then  Ze  = Z.  Writing  *+’  for  the  XOR  operation, 
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we  have  Zx  Zy  = Zx+V  and  so  Zx  Zy  = Z whenever  x + y = e. 


1.19.5  The  building  blocks  of  the  transforms 

Consider  the  following  transforms  on  2-bit  words  where  addition  is  bit-wise  (that  is,  XOR): 


id2  v 
r2  v 
B2  v 
Y2v 
R2  v 
E2  v 


' 1 o' 

a 

a 

0 1 

b 

b 

' 0 1 ' 

a 

' b ' 

1 0 

b 

a 

' 1 1 ' 

a 

a + b 

0 1 

b 

6 

' 1 o' 

a 

a 

1 1 

b 

a + b 

' 0 1 ' 

a 

& 

1 1 

b 

a + b 

' 1 1 ' 

a 

a + b 

1 0 

b 

a 

(1.19-12a) 

(1.19-12b) 

(1.19-12c) 

(1.19-12d) 

(1.19-12e) 

(1.19-12f) 


It  can  easily  be  verified  that  for  these  the  same  relations  hold  as  for  id,  r,  B,  Y,  R,  E.  In  fact  the 
‘color-transforms’,  bit-reversal,  and  identity  are  the  transforms  obtained  as  repeated  Kronecker-products 
of  the  matrices  (see  section  23.3  on  page  462|).  The  transforms  are  linear  over  GF(2): 


Z(aa  + Pb)  = aZ{a)+/3Z{b) 


(1.19-13) 


The  corresponding  version  of  the  bit-reversal  is  [FXT:  bits/revbin.h  : 

1 static  inline  ulong  xrevbin(ulong  a,  ulong  x) 

2 { 


3 

x k=  (BITS  PER 

LONG-1) ; 

//  modulo 

BITS  PER  LONG 

4 

ulong  s = BITS, 

PF.R  LONG 

» l; 

5 

ulong  m = ~0UL 

» s ; 

6 

while  ( s ) 

7 

1 

8 

if  ( x k 1 

) a = ( 

(a  k m)  << 

s ) * ( (a  k (~m) ) 

9 

x »=  1; 

10 

s »=  1; 

11 

m ~=  (m<<s) 

; 

12 

> 

13 

return  a; 

14 

> 

Then,  for  example,  Rx 

= rx  Bx 

see  relation 

1.19-4a  on  page  52 

Muller  transform  (described  in  section  23.11  on  page  4861  of  a binary  word.  The  symbolic  powering  is 


equivalent  to  selecting  individual  levels  of  the  transform. 


1.20  Scanning  for  zero  bytes 

The  following  function  (32-bit  version)  determines  if  any  sub-byte  of  the  argument  is  zero  from  [FXT: 
bits/zerobyte.h  : 

1 static  inline  ulong  contains_zero_byte (ulong  x) 

2 { 

3 return  ( (x-OxOlOlOlOlUL) ~x)  k (~x)  k 0x80808080UL ; 

4 } 

It  returns  zero  when  x contains  no  zero-byte  and  nonzero  when  it  does.  The  idea  is  to  subtract  one  from 
each  of  the  bytes  and  then  look  for  bytes  where  the  borrow  propagated  all  the  way  to  the  most  significant 
bit.  A simplified  version  is  given  in  |2151  sect. 7. 1.3,  rel.90]: 

1 return  0x80808080UL  k ( x - OxOlOlOlOlUL  ) k ~x; 
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To  scan  for  other  values  than  zero  (e.g.  0xa5),  we  can  use 

contains_zero_byte ( x * 0xa5a5a5a5UL  ) 

For  very  long  strings  and  word  sizes  of  64  or  more  bits  the  following  function  may  be  a win  [FXT: 
auxl/bytescan.cc  : 

1 ulong  long_strlen(const  char  *str) 

2 //  Return  length  of  string  starting  at  str. 

3 { 

4 ulong  x; 

5 const  char  *p  = str; 

6 

7 //  Alignment:  scan  bytes  up  to  word  boundary: 

8 while  ( (ulong) p '/.  BYTES_PER_LONG  ) 

9 { 

10  if  ( 0 ==  *p  ) return  (ulong) (p-str) ; 

11  ++p; 


14  x = * (ulong  *)p; 

15  while  ( ! contains_zero_byte (x)  ) 

16  { 

17  p +=  BYTES_PER_L0NG ; 

18  x = *(ulong  *)p; 

19  > 

20 

21  //  now  a zero  byte  is  somewhere  in  x: 

22  while  ( 0 ! = *p  ) { ++p;  } 

23 

24  return  (ulong) (p-str) ; 

25  } 


1.21  Inverse  and  square  root  modulo  2n 


1.21.1  Computation  of  the  inverse 


The  inverse  modulo  2”  where  n is  the  number  of  bits  in  a word  can  be  computed  using  an  iteration  (see 
section|29.1.5  on  page  5691  with  quadratic  convergence.  The  number  to  be  inverted  has  to  be  odd  [FXT: 
bits/bit2adic.h  : 


1 static  inline  ulong  inv2adic (ulong  x) 

2 //  Return  inverse  modulo  2**BITS_PER_L0NG 

3 //  x must  be  odd 

4 //  The  number  of  correct  bits  is  doubled  with  each  step 

5 //  ==>  loop  is  executed  prop.  log_2 (BITS_PER_L0NG)  times 

6 //  precision  is  3,  6,  12,  24,  48,  96,  ...  bits  (or  better) 

7 { 


8 

if  ( 0==(x&l) 

) return  0; 

//  not  invertible 

9 

ulong  i = x; 

//  correct  to 

three  bits  at  least 

10 

ulong  p; 

11 

do 

12 

4 

13 

p = i * x; 

14 

i *=  (2UL 

- p); 

15 

} 

16 

while  ( p ! = 1 ) 

; 

17 

return  i ; 

18 

} 

Let  to  be  the  modulus  (a  power  of  2),  then  the  computed  value  i is  the  inverse  of  x modulo  to:  i = 
x mod  m.  It  can  be  used  for  the  exact  division:  to  compute  the  quotient  a/x  for  a number  a that  is 
known  to  be  divisible  by  x,  simply  multiply  by  i.  This  works  because  a = bx  (a  is  divisible  by  x),  so 
ai  = bxi  = b mod  to. 
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1.21.2  Exact  division  by  C = 2k  ± 1 

We  use  the  following  relation  where  Y = 1 — C: 


A 

C 


A 


= A (1  + Y)  (1  + Y2)  (1  + y4)  (1  + Y8)  . . . (1  + y2™)  mod  Y2 


(1.21-1) 


i-y 

The  relation  can  be  used  for  efficient  exact  division  over  Z by  C = 2k  ± 1.  For  C = 2k  + 1 use 

^ = A(l-2fe)(l  + 2fe2)(l  + 2fe4)(l  + 2fc8)  •••  (l  + 2fe2“)  mod  2N  (1.21-2) 

where  k 2U  > N.  For  C = 2k  — 1 use  {A/C  = — A / — C) 

£ = -A{ l + 2fe)(l  + 2fc2)(l  + 2fe4)(l  + 2fes)  •••  (l  + 2fe2“)  mod  2N  (1.21-3) 

o 


The  equivalent  method  for  exact  division  by  polynomials  (over  GF(2))  is  given  in  section  40.1.6  on 
|page  826j 


1.21.3  Computation  of  the  square  root 
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Figure  1.21-A:  Examples  of  the  inverse  and  square  root  modulo  2"  of  x where  —9  < x < +9.  Where 


no  inverse  or  square  root  is  given,  it  does  not  exist. 

With  the  inverse  square  root  we  choose  the  start  value  to  match  [d/2\  +1  as  that  guarantees  four  bits 
of  initial  precision.  Moreover,  we  control  which  of  the  two  possible  values  of  the  inverse  square  root  is 
computed.  The  argument  modulo  8 has  to  be  equal  to  1. 

1 static  inline  ulong  invsqrt2adic (ulong  d) 

2 //  Return  inverse  square  root  modulo  2**BITS_PER_L0NG 

3 //  Must  have:  d==l  mod  8 

4 //  The  number  of  correct  bits  is  doubled  with  each  step 

5 //  ==>  loop  is  executed  prop.  log_2 (BITS_PER_L0NG)  times 

6 //  precision  is  4,  8,  16,  32,  64,  ...  bits  (or  better) 

7 { 

8 if  ( 1 ! = (d&7)  ) return  0;  //no  inverse  sqrt 

9 //  start  value:  if  d ==  ****10001  ==>  x :=  ****1001 

10  ulong  x = (d  » 1)  I 1; 

11  ulong  p,  y; 

12  do 

13  1 

14  y = x; 

15  p = (3  - d * y * y)  ; 

16  x = (y  * p)  » 1; 

17  > ‘ 

18  while  ( x ! =y  ) ; 

19  return  x; 

20  } 
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The  square  root  is  computed  as  d ■ 1/y/d: 

1 static  inline  ulong  sqrt2adic (ulong  d) 

//  Return  square  root  modulo  2**BITS_PER_L0NG 
//  Must  have:  d==l  mod  8 or  d==4  mod  32,  d==16  mod  128 

//  ...  d==4**k  mod  4**(k+3) 

//  Result  undefined  if  condition  does  not  hold 
{ 

if  ( 0==d  ) return  0; 
ulong  s = 0; 

while  ( 0==(d&l)  ) { d >>=  1;  ++s;  } 

10  d *=  invsqrt2adic (d) ; 

11  d «=  (s»l)  ; 

12  return  d; 

13  } 


Note  that  the  square  root  modulo  2n  is  something  completely  different  from  the  integer  square  root  in 
general.  If  the  argument  d is  a perfect  square,  then  the  result  is  ±\/d.  The  output  of  the  program  [FXT: 


bits/bit2adic-demo.cc  is  shown  in  figure  1.21-A  For  further  information  see  [2131  ex. 31,  p.213],  [l35i. 
chap. 6,  p.126] , and  also  [208j. 


1.22  Radix  —2  (minus  two)  representation 

The  radix  —2  representation  of  a number  n is 


OO 

n = ^tfc(-2)fe  (1.22-1) 

k= 0 

where  the  tk  are  zero  or  one.  For  integers  n the  sum  is  terminating:  the  highest  nonzero  tk  is  at  most 
two  positions  beyond  the  highest  bit  of  the  binary  representation  of  the  absolute  value  of  n (with  two’s 
complement). 

1.22.1  Conversion  from  binary 
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Figure  1.22- A:  Radix  —2  representations  and  their  Gray  codes.  Lines  ending  in  ‘<=N’  indicate  that  all 


values  < N occur  in  the  last  column  up  to  that  point. 
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A surprisingly  simple  algorithm  to  compute  the  coefficients  tk  of  the  radix  —2  representation  of  a binary 
number  is  [35}  item  128]  [FXT:  bits/negbin.h  : 

1 static  inline  ulong  bin2neg(ulong  x) 

//  binary  — > radix (-2) 

{ 

const  ulong  m = OxaaaaaaaaUL ; //  32  bit  version 
x +=  m; 
x ~=  m; 
return  x ; 

} 


An  example: 


14  — > . .1.  .1.  ==  16  - 2 ==  (-2) ~4  + (-2) "1 

e inverse  routine  executes  the  inverse  of  the  two  steps  in  reversed  order: 

static  inline  ulong  neg2bin(ulong  x) 

//  radix (-2)  — > binary 
//  inverse  of  bin2neg() 

{ 

const  ulong  m = OxaaaaaaaaUL;  //  32-bit  version 
x “=  m; 
x -=  m; 
return  x ; 

} 


Figure  1.22-A  shows  the  output  of  the  program  [FXT:  bits/negbin-demo.cc  . The  sequence  of  Gray  codes 
of  the  radix  —2  representation  is  a Gray  code  for  the  numbers  in  the  range  0, . . . , k for  the  following 
values  of  k (entry  A002450  in  [3l2jl: 


k = 1,  5,  21,  85,  341,  1365,  5461,  21845,  87381,  349525,  1398101,  . . . , (4n  - l)/3 


1.22.2  Fixed  points  of  the  conversion  f 


0:  64:  1 256:  ..1 320:  ..1.1 

1:  1 65:  1 1 257:  ..1 1 321:  ..1.1 1 

4:  1..  68:  1...1..  260:  ..1 1..  324:  ..1.1...1.. 

5:  1.1  69:  1...1.1  261:  ..1 1.1  325:  ..1.1...  1.1 

16:  1 80:  1.1 272:  ..1...1 336:  ..1.1.1 

17:  1...1  81:  1.1...  1 273:  ..1...1...1  337:  . .1.1.1...  1 

20:  1.1..  84:  1.1.1..  276:  ..1...1.1..  340:  ..1.1. 1.1.. 

21:  1.1.1  85:  1.1. 1.1  277:  ..1...  1.1.1  341:  ..1.1. 1.1.1 


Figure  1.22-B:  The  fixed  points  of  the  conversion  and  their  binary  representations  (dots  denote  zeros) . 


The  sequence  of  fixed  points  of  the  conversion  starts  as 

0,  1,  4,  5,  16,  17,  20,  21,  64,  65,  68,  69,  80,  81,  84,  85,  256,  ... 

. This  is  the  Moser  - 
De  Bruijn  sequence , entry  A000695  in  [3121.  The  generating  function  of  the  sequence  is 


The  binary  representations  have  ones  only  at  even  positions  (see  figure  1.22-B 


1 

1 — x 


OO  . A nJ 

E4j  x 2 
1 + x2j 


x + 4x2  + 5x3  + 16x4  + 17a:5  + 20 x6  + 21a:7  + 64a:8  + 65a;9  + . . . (1.22-2) 


The  sequence  also  appears  as  exponents  in  the  power  series  (see  also  section  38.10.1  on  page  7501 


J|^l+x4^  = 1 + x + x4  + x5  + x16  + x17  + x20  + x21  + x64  + x65  + x68  + . . . (1.22-3) 

fc=0 


The  £;-th  fixed  point  is  computed  by  moving  all  bits  of  the  binary  representation  of  k to  position  2 x 
where  x > 0 is  the  index  of  the  bit  under  consideration: 


CD  00  05  or  ^ CO  to  M ‘ oi  ^ CO  to  I— 1 O CD  00  Ci  Cn  ^ CO  to 
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1 static  inline  ulong  negbin_f ixed_point (ulong  k) 

2 { 

3 return  bit_zipO(k); 

4 } 

The  bit-zip  function  is  given  in  section  1.15  on  page  |39|  The  sequence  of  radix  —2  representations  of 
0,  1,  2,  . . interpreted  as  binary  numbers,  is  entry  A0T13351  in  [3l2|: 

0,1,6,7,4,5,26,27,24,25,30,31,28,29,18,19,16,17,22,23,20,21,106,107,104,105,110,111,  ... 

The  corresponding  sequence  for  the  negative  numbers  — 1,  —2,  —3,  ...  is  entry  A005352: 

3,2,13,12,15,14,9,8,11,10,53,52,55,54,49,48,51,50,61,60,63,62,57,56,59,58,37,36,39,38,  ... 

More  information  about  ‘non-standard’  representations  of  numbers  can  be  found  in  |213|. 


1.22.3  Generating  negbin  words  in  order 


llllllllllllllllllllllllllllllllllllllllll 

llllllllllllllllllllllllllllllll 

llllllllllllllll 1111111111111111 

llllllll llllllll llllllll llllllll.. 

. .1111 1111 1111 1111 1111 1111 1111 1111. . 

..11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11. .11 
.1.1. l.l. l.l. 1.1. l.l. l.l. 1.1. l.l. l.l. 1.1. l.l. l.l. 1.1. l.l. l.l. l.l 


lllllllllllllllllllll 

lllllllllllllllllllll 

llllllllllllllllllllllllllllllll 

llllllllllllllll llllllllllllllll 

...llllllll llllllll llllllll llllllll 

...mi mi mi nil nil nil nil nil. 

.11.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n.. n. 
n.i. in. in. in. in. in. in. in. in. in. in. in. in. in. in. in 


Figure  1.22-C:  Radix  —2  representations  of  the  numbers  0 . . . +63  (top)  and  0 . . . — 63  (bottom). 


A radix  —2  representation  can  be  incremented  by  the  function  [FXT:  bits/negbin.h|  (32-bit  versions  in 
what  follows): 

1 static  inline  ulong  next_negbin(ulong  x) 

//  With  x the  radix (-2)  representation  of  n 
//  return  radix(-2)  representation  of  n+1 . 

{ 

const  ulong  m = OxaaaaaaaaUL ; 
x “=  m; 

++x; 
x ~=  m; 
return  x; 

} 

A version  without  constants  is 
ulong  s = x « 1 ; 
ulong  y = x * s; 

y +=  1; 

s ~=  y; 
return  s ; 

Decrementing  can  be  done  via 


static  inline  ulong  prev_negbin (ulong  x) 

//  With  x the  radix (-2)  representation  of  n 
//  return  radix(-2)  representation  of  n-1. 

{ 

const  ulong  m = OxaaaaaaaaUL; 
x ~=  m; 

— x; 
x ~=  m; 
return  x; 

10  } 


or  via 


cn  to  i— 1 
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const  ulong  m = 0x55555555UL; 
x “=  m; 

++x; 
x “=  m; 
return  x; 


The  functions  are  quite  fast,  about  730  million  words  per  second  are  generated  (3  cycles  per  increment 
or  decrement).  Figure  1.22-C  shows  the  generated  words  in  forward  (top)  and  backward  (bottom)  order. 
It  was  created  with  the  program  [FXT:  bits/negbin2-demo.cc  . 


1.23  A sparse  signed  binary  representation 
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+32  -8 

-1 

24 

. .11. . . 

.P.M. . . 

24 

= 

+32  -8 

25 

. .11.  .1 

.P.M. .P 

25 

= 

+32  -8 

+ 1 

26 

. .11.1. 

.P.M.P. 

26 

= 

+32  -8 

+2 

27 

. .11.11 

.P. .M.M 

27 

= 

+32  -4 

-1 

28 

. .111. . 

.P. .M. . 

28 

= 

+32  -4 

29 

. .111.1 

.P. .M.P 

29 

= 

+32  -4 

+ 1 

30 

. .1111. 

.P. . .M. 

30 

= 

+32  -2 

31 

. .11111 

.P M 

31 

= 

+32  -1 

32 

.1 

.P 

32 

= 

+32 

Figure  1.23-A:  Sparse  signed  binary  representations  (nonadjacent  form,  NAF).  The  symbols  ‘P’  and  ‘M’ 


are  respectively  used  for  +1  and  — 1,  dots  denote  zeros. 


0 

0 

= 

1 

1 

P 

1 

= 

+ 1 

2 

1. 

P. 

2 

= 

+2 

4 

1.  . 

P.  . 

4 

= 

+4 

5 

1.1 

P.P 

5 

= 

+4  +1 

8 

1.  . . 

P.  . . 

8 

= 

+8 

9 

1.  .1 

. . . .P. .P 

9 

= 

+8  +1 

10 

1.1. 

P.P. 

10 

= 

+8  +2 

16 

. . .1 

. . .P 

16 

= 

+ 16 

17 

. . .1.  . .1 

. . .P. . .P 

17 

= 

+ 16  +1 

18 

. . .1.  .1. 

. . .P. .P. 

18 

= 

+ 16  +2 

20 

. . .1.1.  . 

. . .P.P. . 

20 

= 

+ 16  +4 

21 

..  .1.1.1 

.. .P.P.P 

21 

= 

+ 16  +4 

+ 1 

32 

. .1 

. .P 

32 

= 

+32 

33 

. .1 1 

. .P. . . .P 

33 

= 

+32  +1 

34 

. .1.  . .1. 

. .P. . .P. 

34 

= 

+32  +2 

36 

. .1.  .1.  . 

. .P. .P. . 

36 

= 

+32  +4 

37 

. .1. .1.1 

. .P. .P.P 

37 

= 

+32  +4 

+1 

40 

. .1.1.  . . 

. .P.P. . . 

40 

= 

+32  +8 

41 

. .1.1.  .1 

. .P.P.  .P 

41 

= 

+32  +8 

+ 1 

42 

. .1.1.1. 

. .P.P.P. 

42 

= 

+32  +8 

+2 

64 

.1 

.P 

64 

= 

+64 

Figure  1.23-B:  The  numbers  whose  negative  part  in  the  NAF  representation  is  zero. 
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An  algorithm  to  compute  a representation  of  a number  x as 


x = 


fc= o 


Sfc  • 2fc  where  Sk  € { — 1,  0,  +1} 


(1.23-1) 


such  that  two  consecutive  digits  Sk,  Sk+i  are  never  simultaneously  nonzero  is  given  in  12751.  Figure  1.23- A 
gives  the  representation  of  several  small  numbers.  It  is  the  output  of  [FXT:  bits/bin2naf-demo.cc|. 

We  can  convert  the  binary  representation  of  x into  a pair  of  binary  numbers  that  correspond  to  the 
positive  and  negative  digits  [FXT:  bits/bin2naf.h  : 

static  inline  void  bin2naf (ulong  x,  ulong  &np,  ulong  &nm) 

//  Compute  (nonadjacent  form,  NAF)  signed  binary  representation  of  x: 

//  the  unique  representation  of  x as 
//  x=\sum_{k}-{d_k*2"k}  where  d_j  \in  {-1,0, + 1} 

//  and  no  two  adjacent  digits  d_ j , d_{j+l}  are  both  nonzero. 


10 

11 

12 

13 

14 

15 


//  np  has  bits 
//  nm  has  bits 
//  We  have : x = 

{ 

ulong  xh  = x 
ulong  x3  = x 
ulong  c = xh 
np  = x3  & c ; 
nm  = xh  & c ; 


set  where  d_j==+l 
set  where  d_j==-l 
np  - nm 


» l; 

+ xh; 
x3; 


//  x/2 
//  3*x/2 


Converting  back  to  binary  is  trivial: 

1 static  inline  ulong  naf 2bin(ulong  np,  ulong  nm)  { return  ( np  - nm  ) ; } 

The  representation  is  one  example  of  a nonadjacent  form  (NAF).  A method  for  the  computation  of  certain 
nonadjacent  forms  (ic-NAF)  is  given  in  1255; . A Gray  code  for  the  signed  binary  words  is  described  in 
section  |14.7  on  page  315| 

If  a binary  word  contains  no  consecutive  ones,  then  the  negative  part  of  the  NAF  representation  is  zero. 


The  sequence  of  values  is  [0,  1,  2,  4,  5,  8,  9,  10,  16,  . . .],  entry  A003714  in  [3 12],  see  figure  1.23-B  The 
numbers  are  called  the  Fibbinary  numbers. 


1.24  Generating  bit  combinations 

1.24.1  Co-lexicographic  (colex)  order 

Given  a binary  word  with  k bits  set  the  following  routine  computes  the  binary  word  that  is  the  next 
combination  of  k bits  in  co-lexicographic  order.  In  the  co-lexicographic  order  the  reversed  sets  are  sorted, 
see  figure  |1.24-A|  The  method  to  determine  the  successor  is  to  determine  the  lowest  block  of  ones  and 
move  its  highest  bit  one  position  up.  Then  the  rest  of  the  block  is  moved  to  the  low  end  of  the  word 
[FXT:  bits/bitcombcolex.h|: 

1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 
13 

One  could  replace  the  while-loop  by  a bit  scan  and  shift  combination.  The  combinations  (i^)  are  generated 
at  a rate  of  about  142  million  per  second.  The  rate  is  about  120  M/s  for  the  combinations  (^),  the  rate 
with  (6?°)  is  70  M/s,  and  with  (®g)  it  is  160  M/s. 


static  inline  ulong  next_colex_comb (ulong  x) 

{ 

ulong  r = x k -x;  //  lowest  set  bit 

x +=  r;  //  replace  lowest  block  by  a one  left  to  it 

if  ( 0==x  ) return  0;  //  input  was  last  combination 

ulong  z = x & -x;  //  first  zero  beyond  lowest  block 

z -=  r;  //  lowest  block  (cf.  lowest_block() ) 

while  ( 0==(z&l)  ) { z >>=  1;  1 //  move  block  to  low  end  of  word 

return  x I (z>>!) ; //  need  one  bit  less  of  low  block 


00  -vj  OiOi  ^ CO  to  I— 1 ■ Gi  cn  ^ CO  to  I— 1 CO  00  -a  Oi  or  CO  to  I— 1 ■ -aOiCnrf^COtO 
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word 

= 

set 

= 

set 

(reversed) 

1 

. . .111 

= 

1 

0 

1, 

2 

> 

= 

2, 

1, 

0 

> 

2 

. .1.11 

= 

{ 

0 

1, 

3 

> 

= 

{ 

3, 

1, 

0 

> 

3 

. .11.1 

= 

{ 

0 

2, 

3 

} 

= 

{ 

3, 

2, 

0 

> 

4 

. .111. 

= 

{ 

1 

2, 

3 

> 

= 

{ 

3, 

2, 

1 

> 

5 

.1. .11 

= 

{ 

0 

1, 

4 

> 

= 

< 

4, 

1, 

0 

> 

6 

.1.1.1 

= 

{ 

0 

2, 

4 

> 

= 

{ 

4, 

2, 

0 

> 

7 

.1.11. 

= 

{ 

1 

2, 

4 

> 

= 

{ 

4, 

2, 

1 

> 

8 

.11.  .1 

= 

< 

0 

3, 

4 

> 

= 

{ 

4, 

3, 

0 

> 

9 

.11.1. 

= 

{ 

1 

3, 

4 

> 

= 

{ 

4, 

3, 

1 

> 

10 

.111.  . 

= 

{ 

2 

3, 

4 

> 

= 

{ 

4, 

3, 

2 

> 

11 

1. . .11 

= 

{ 

0 

1. 

5 

} 

= 

{ 

5, 

1, 

0 

> 

12 

1.  .1.1 

= 

{ 

0 

2, 

5 

> 

= 

{ 

5, 

2, 

0 

> 

13 

1. .11. 

= 

{ 

1 

2, 

5 

> 

= 

{ 

5, 

2, 

1 

> 

14 

1.1.  .1 

= 

{ 

0 

3, 

5 

} 

= 

{ 

5, 

3, 

0 

> 

15 

1.1.1. 

= 

{ 

1 

3, 

5 

> 

= 

{ 

5, 

3, 

1 

> 

16 

1.11.  . 

= 

{ 

2 

3, 

5 

} 

= 

{ 

5, 

3, 

2 

> 

17 

11.  . .1 

= 

{ 

0 

4, 

5 

> 

= 

{ 

5, 

4, 

0 

> 

18 

11.  .1. 

= 

{ 

1 

4, 

5 

> 

= 

{ 

5, 

4, 

1 

> 

19 

11.1.  . 

= 

{ 

2 

4, 

5 

} 

= 

{ 

5, 

4, 

2 

> 

20 

111.  . . 

— 

{ 

3 

4, 

5 

> 

= 

{ 

5, 

4, 

3 

> 

Figure  1.24-A:  Combinations  (®)  in  co-lexicographic  order.  The  reversed  sets  are  sorted. 


A variant  of  the  method  which  involves  a division  appears  in  [39(  item  175].  The  routine  given  here  is 
due  to  Doug  Moore  and  Glenn  Rhoads. 

The  following  routine  computes  the  predecessor  of  a combination: 

1 static  inline  ulong  prev_colex_comb(ulong  x) 

//  Inverse  of  next_colex_comb() 

{ 

x = next_colex_comb(  ~x  ); 
if  ( 0!=x  ) x = ~x; 
return  x ; 

} 

The  first  and  last  combination  can  be  computed  via 
static  inline  ulong  first_comb (ulong  k) 

//  Return  the  first  combination  of  (i.e.  smallest  word  with)  k bits, 

//  i.e.  00 .. 001111 .. 1 (k  low  bits  set) 

//  Must  have:  0 <=  k <=  BITS_PER_LONG 

{ 

ulong  t = ~0UL  » ( BITS_PER_LONG  - k ) ; 

if  ( k==0  ) t = 0;  //  shift  with  BITS_PER_LONG  is  undefined 

return  t ; 

} 

and 

static  inline  ulong  last_comb (ulong  k,  ulong  n=BITS_PER_LONG) 

//  return  the  last  combination  of  (biggest  n-bit  word  with)  k bits 
//  i.e.  1111 . . 100 . . 00  (k  high  bits  set) 

//  Must  have:  0 <=  k <=  n <=  BITS_PER_L0NG 

{ 

return  f irst_comb(k)  « (n  - k) ; 

} 


The  if-statement  in  f irst_comb()  is  needed  because  a shift  by  more  than  BITS_PER_LONG— 1 is  undefined 


by  the  C-standard,  see  section  1.1.5  on  page  4 


The  listing  in  figure  1.24-A  can  be  created  with  the  program  [FXT:  bits/bitcombcolex-demo.cc  : 


ulong  n = 6,  k = 3; 

ulong  last  = last_comb(k,  n) ; 

ulong  g = f irst_comb(k) ; 

along  gg  = 0; 

do 

1 

//  visit  combination  given  as  word  g 
gg  = g; 
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9 g = next_colex_comb(g) ; 

10  } 

11  while  ( gg!=last  ); 


1.24.2  Lexicographic  (lex)  order 


lex 

(5, 

3) 

colex 

(5, 

2) 

word 

set 

word 

set 

1 

. .111 

= 

4 0, 

1, 

2 

> 

. . .11 

= 

4 0, 

1 

> 

2 

.1.11 

= 

4 0, 

1, 

3 

> 

. .1.1 

= 

4 0, 

2 

> 

3 

1.  .11 

= 

4 0, 

1, 

4 

> 

. .11. 

= 

4 1, 

2 

> 

4 

.11.1 

= 

4 0, 

2, 

3 

> 

.1.  .1 

= 

4 0, 

3 

> 

5 

1.1.1 

= 

4 0, 

2, 

4 

> 

.1.1. 

= 

4 l. 

3 

> 

6 

11.  .1 

= 

4 0, 

3, 

4 

> 

.11.  . 

= 

4 2, 

3 

> 

7 

.111. 

= 

4 l. 

2, 

3 

> 

1.  . .1 

= 

4 0, 

4 

> 

8 

1.11. 

= 

4 1, 

2, 

4 

> 

1.  .1. 

= 

4 1, 

4 

> 

9 

11.1. 

= 

4 1, 

3, 

4 

> 

1.1.  . 

= 

4 2, 

4 

> 

10 

111.  . 

= 

4 2, 

3, 

4 

> 

11.  . . 

= 

4 3, 

4 

> 

Figure  1.24-B:  Combinations  (jj)  in  lexicographic  order  (left).  The  sets  are  sorted.  The  binary  words 
with  lex  order  are  the  bit-reversed  complements  of  the  words  with  colex  order  (right). 


The  binary  words  corresponding  to  combinations  (?)  in  lexicographic  order  are  the  bit-reversed  com- 
plements of  the  words  for  the  combinations  ( ™fe)  in  co-lexicographic  order,  see  figure 


1.24-B 


A more 

precise  term  for  the  order  is  subset- lex  (for  sets  written  with  elements  in  increasing  order).  The  sequence 
is  identical  to  the  delta-set-colex  order  backwards. 


The  program  [FXT:  bits/bitcomblex-demo.cc|  shows  how  to  compute  the  subset-lex  sequence  efficiently: 

1 ulong  n = 5,  k = 3; 

2 ulong  x = f irst_comb(n-k) ; //  first  colex  (n-k  choose  n) 

3 const  ulong  m = f irst_comb(n) ; //  aux  mask 

4 const  ulong  1 = last_comb(k,  n) ; //  last  colex 

5 ulong  ct  = 0; 

6 ulong  y; 

7 do 

8 4 

9 y = revbin(~x,  n)  & m;  //  lex  order 

10  //  visit  combination  given  as  word  y 

11  x = next_colex_comb(x) ; 

12  } 

13  while  ( y ! = 1 ) ; 


The  bit-reversal  routine  revbinO  is  shown  in  section  [1.14  on  page  33|  Sections  |6. 2.1  on  page  177  and 
section  6.2.2  give  iterative  algorithms  for  combinations  (represented  by  arrays)  in  lex  and  colex  order, 
respectively. 


1.24.3  Shifts-order 


1: 

1 

1 

11.  . . 

1 

111.  . 

1: 

1111. 

2: 

.1.  . . 

2 

. 11.  . 

2 

.111. 

2: 

.1111 

3: 

. .1.  . 

3 

. . 11. 

3 

. .111 

3: 

111.1 

4: 

...  1. 

4 

. . .11 

4 

11.1. 

4: 

11.11 

5: 

1 

5 

1.1.  . 

5 

.11.1 

5: 

1.111 

6 

.1.1. 

6 

11.  . 1 

7 

. .1.1 

7 

1.11. 

8 

1.  .1. 

8 

.1.11 

9 

.1.  .1 

9 

1.1.1 

10 

1.  . .1 

10 

1.  .11 

Figure  1.24-C:  Combinations  (®),  for  k = 1,2, 3, 4 in  shifts-order. 


Figure  1.24-C  shows  combinations  in  shifts-order.  The  order  for  combinations  (™)  is  obtained  from  the 
shifts-order  for  subsets  (section  8.4  on  page  208 ) by  discarding  all  subsets  whose  number  of  elements  are 
k and  reversing  the  list  order.  The  first  combination  is  [lfc0n-fe]  and  the  successor  is  computed  as 
follows  (see  figure  1.24-D|): 
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1 

1111.  . . 

18 

.11.  . 11 

2 

.1111.  . 

19 

11.  .1.1 

< 

S 

3 

. .1111. 

20 

11.  . .11 

< 

S-2 

4 

. . .1111 

21 

1.111.  . 

< 

S-2 

5 

111.1.  . 

< 

S 

22 

.1.111. 

6 

.111.1. 

23 

. .1.111 

7 

. .111.1 

24 

1.11.1. 

< 

S 

8 

111. .1. 

< 

S 

25 

.1.11.1 

9 

. 111.  . 1 

26 

1.11.  .1 

< 

S 

10 

111.  . . 1 

< 

s 

27 

1.1.11. 

< 

S-2 

1 1 

11.11.  . 

< 

S-2 

28 

.1.1.11 

12 

.11.11. 

29 

1.1. 1.1 

< 

S 

13 

. .11.11 

30 

1.1. .11 

< 

S-2 

14 

11.1.1. 

< 

S 

31 

1. .111. 

< 

S-2 

15 

.11.1.1 

32 

.1. .111 

16 

11.1.  .1 

< 

S 

33 

1. .11.1 

< 

S 

17 

11. .11. 

< 

S-2 

34 

1. .1.11 

< 

S-2 

18 

.11.  .11 

35 

1. . .111 

< 

S-2 

Figure  1.24-D:  Updates  with  combinations  Q):  simple  split  ‘S’,  split  second  ‘S-2’,  easy  case  unmarked. 


1.  Easy  case:  if  the  rightmost  one  is  not  in  position  zero  (least  significant  bit),  then  shift  the  word  to 
the  right  and  return  the  combination. 

2.  Finished?:  if  the  combination  is  the  last  one  ([0n],  [0"-1l],  [10"-fclfc-1]),  then  return  zero. 

3.  Shift  back:  shift  the  word  to  the  left  such  that  the  leftmost  one  is  in  the  leftmost  position  (this  can 
be  a no-op). 

4.  Simple  split:  if  the  rightmost  one  is  not  the  least  significant  bit,  then  move  it  one  position  to  the 
right  and  return  the  combination. 

5.  Split  second  block:  move  the  rightmost  bit  of  the  second  block  (from  the  right)  of  ones  one  position 
to  the  right  and  attach  the  lowest  block  of  ones  and  return  the  combination. 

An  implementation  is  given  in  [FXT:  bits/bitcombshifts.h|: 

1 class  bit_comb_shifts 

2 { 

3 public: 

4 ulong  x_;  //  the  combination 

5 ulong  s_;  //  how  far  shifted  to  the  right 

6 ulong  n_,  k_;  //  combinations  (n  choose  k) 

7 ulong  last_;  //  last  combination 

§ public: 

10  bit_comb_shifts (ulong  n,  ulong  k) 

11  { 

12  n_  = n;  k_  = k; 

13  f irst  () ; 

14  > 

15 

16  ulong  first (ulong  n,  ulong  k) 

s_  = 0; 

x_  = last_comb(k,  n) ; 

if  ( k>l  ) last.  = f irst  .comb  (k-1)  I (1UL«  (n_-l) ) ; //  [10000111] 

else  last.  = k;  //  [000001]  or  [000000] 

return  x_; 


27  ulong  firstO  { return  first(n_,  k_)  ; } 

28 

29  ulong  nextO 

30  { 

31  if  ( 0==(x_&l)  ) //  easy  case: 

32  { 

33  ++s_; 

34  x_  »=  1; 

35  return  x_; 

36  } 

37  else  //  splitting  cases: 


17 

18 

19 

20 
21 
22 


25 

26 
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38 

39 

40 

41 

42 


45 

46 

47 

48 

49 

50 

51 

52 

53 

54 


57 

58 

59 

60 
61 
62 

63 

64 

65 

66 


}; 


if  ( x_  ==  last_  ) return  0;  //  combination  was  last 

x_  <<=  s_ ; s_  = 0;  //  shift  back  to  the  left 

ulong  b = x_  & -x_;  //  lowest  bit 

if  ( b!=lUL  ) //  simple  split 

{ 

x_  -=  (b»l)  ; //  move  rightmost  bit  to  the  right 

return  x_ ; 

} 

else  //  split  second  block  and  attach  first 

{ 

ulong  t = low_ones (x_) ; //  block  of  ones  at  lower  end 

x_  “=  t;  //  remove  block 

ulong  b2  = x_  & -x_;  //  (second)  lowest  bit 

b2  »=  1; 

x_  -=  b2;  //  move  bit  to  the  right 

//  attach  block: 

do  { t<<=l;  } while  ( 0==(t&x_)  ); 
x_  | = (t»l)  ; 
return  x_ ; 

} 

} 


The  combinations  are  generated  at  a rate  of  about  150  M/s,  for  the  combinations  (^)  the  rate  is 
about  220  M/s  [FXT:  bits/bitcombshifts-demo.cc  . The  rate  with  the  combinations  (67°)  is  415  M/s  and 
with  (®g)  it  is  110  M/s.  The  generation  is  very  fast  for  the  sparse  case. 


1.24.4  Minimal-change  order  f 

The  following  routine  is  due  to  Doug  Moore  [FXT:  bits/bitcombminchange.h  : 

1 static  inline  ulong  igc_next_minchange_comb (ulong  x) 

2 //  Return  the  inverse  Gray  code  of  the  next  combination  in  minimal-change  order. 

3 //  Input  must  be  the  inverse  Gray  code  of  the  current  combination. 

4 { 

5 ulong  g = rev_gray_code (x) ; 

6 ulong  i = 2; 

7 ulong  cb;  //  ==candidate  bits; 

8 do 

9 1 


10 

ulong  y = (x  & ~(i-l))  + i; 

11 

ulong  j = lowest_one (y)  « 1; 

12 

ulong  h = ! ! (y  & j ) ; 

13 

cb  = ((j-h)  g)  & (j-i); 

14 

i = j ; 

15 

> 

16 

while  ( 0==cb  ) ; 

17 

18 

return  x + lowest_one (cb) ; 

19  } 

It  can  be  used  as  suggested  by  the  routine 

1 static  inline  ulong  next_minchange_comb (ulong  x,  ulong  last) 

2 //  Not  efficient,  just  to  explain  the  usage  of  igc_next_minchange_comb() 

3 //  Must  have:  last==igc_last_comb(k,  n) 

4 { 

5 x = inverse_gray_code(x) ; 

6 if  ( x==last  ) return  0; 

7 x = igc_next_minchange_comb(x) ; 

8 return  gray_code (x) ; 

9 } 

The  auxiliary  function  igc_last_comb()  is  (32-bit  version  only) 

1 static  inline  ulong  igc_last_comb (ulong  k,  ulong  n) 

2 //  Return  the  (inverse  Gray  code  of  the)  last  combination 
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3 //  as  in  igc_next_minchange_comb() 

4 { 

5 if  ( 0==k  ) return  0 ; 

6 

7 const  ulong  f = OxaaaaaaaaUL  >>  (BITS_PER_LDNG-k) ; //  ==  f irst_sequency (k) ; 

8 const  ulong  c = ~0UL  » (BITS_PER_LONG-n) ; //  ==  f irst_comb(n) ; 

9 return  c * (f  »1)  ; 

10  //  =“=  (by  Doug  Moore) 

11  //  return  ((lUL«n)  - 1)  ~ (((lUL«k)  - 1)  / 3)  ; 

12  } 

Successive  combinations  differ  in  exactly  two  positions.  For  example,  with  n = 5 and  k = 3: 


111 
.1.1 
.11. 
..11 
.. . 1 
..1. 
.1. . 
1.1 
11. 
. 11 


inverse_gray_code (x) 


. .1.1 
. 1 . . 

.1.1 
.11. 
1.  . . 

1.  .1 
1.11 
11.  . 
11.1 
111.1 


f irst_sequency (k) 


igc_last_comb(k,  n) 


The  same  run  of  bit  combinations  would  be  generated  by  going  through  the  Gray  codes  and  omitting  all 
words  where  the  bit-count  is  not  equal  to  k.  The  algorithm  shown  here  is  much  more  efficient. 

For  greater  efficiency  one  may  prefer  code  which  avoids  the  repeated  computation  of  the  inverse  Gray 
code,  for  example: 

1 ulong  last  = igc_last_comb(k,  n) ; 

2 ulong  c,  nc  = f irst_sequency (k) ; 

3 do 

4 { 

5 c = nc ; 

6 nc  = igc_next_minchange_comb(c) ; 

7 ulong  g = gray_code (c) ; 

8 //  Here  g contains  the  bit-combination 

9 } 

10  while  ( c!=last  ); 


n = 

6 k = 2 

n = 

6 k = 3 

n = 

6 k = 4 

ii 

1. 

1. 

. . .in 

. . .1.1 

. . .1.  . 

. .mi 

. .1.1. 

. .1.  . . 

. . .11. 

. . .1.  . 

1. 

. .li.i 

. .1.  .1 

1. 

•li.ii 

.1. .1. 

1. 

. . .l.i 

. . .11. 

1. 

. .in. 

. .1.11 

1. 

.mi. 

.1.1. . 

1. 

. .ii. . 

. .1. . . 

. . .1.  . 

. .i.ii 

. .11.1 

. . .1.  . 

.lii.i 

.1.11. 

. . .1.  . 

. .l.i. 

. .11.  . 

1. 

.ii.  .i 

.1.  . .1 

1. 

.l.iii 

.11.1. 

. .1.  . . 

. .i.  .i 

. .111. 

1. 

.li.i. 

.1. .11 

. . .1.  . 

ii. .ii 

1. . .1. 

1. 

.ii. . . 

.1 

. .1.  . . 

.in. . 

.1.111 

1. 

li.ii. 

1.  .1.  . 

1. 

.l.i. . 

.11.  . . 

. . .1.  . 

.l.i.i 

.11. .1 

1. 

li.i.i 

1. .11. 

1. 

,i.  ,i. 

.111. . 

1. 

.i.ii. 

.11.11 

1. 

1111.  . 

1.1. . . 

. . .1.  . 

.1. . .i 

.1111. 

1. 

.1. .ii 

.111.1 

. . .1.  . 

111.1. 

1.11.  . 

1. 

ii 

1 

.1 

ii. . ,i 

1 1 

1. 

111. .1 

1.111. 

. . .1.  . 

i.i. . . 

11 

. .1.  . . 

ii. .i. 

1. . .11 

. . .1.  . 

1.1.11 

11. .1. 

1. 

i. ,i. . 

111.  . . 

. . .1.  . 

n.i. . 

1. .111 

. .1.  . . 

1.111. 

11.1.  . 

1. 

i. . .i. 

1111.  . 

1. 

in. . . 

1.1111 

1. 

1.11.1 

11.11. 

. . .1.  . 

i i 

11111. 

1. 

i.i.  ,i 

11.  . .1 

1. 

1. .111 

111.1. 

. .1.  . . 

l.i.i. 

11. .11 

. . .1.  . 

i.ii. . 

11.111 

1. 

i. .i.i 

111. .1 

1. 

i. .ii. 

111.11 

1. 

i. . .ii 

1111.1 

. . .1.  . 

Figure  1.24-E:  Minimal-change  combinations,  their  inverse  Gray  codes,  and  the  differences  of  the  inverse 


Gray  codes.  The  differences  are  powers  of  2. 


The  difference  of  the  inverse  Gray  codes  of  two  successive  combinations  is  always  a power  of  2,  see 
figure  1.24-E|  (the  listings  were  created  with  the  program  [FXT:  bits/bitcombminchange-demo.cc  ).  With 
this  observation  we  can  derive  a different  version  that  checks  the  pattern  of  the  change: 

1 static  inline  ulong  igc_next_minchange_comb (ulong  x) 

2 //  Alternative  version. 

3 { 

4 ulong  gx  = gray_code(  x ); 

5 ulong  i = 2; 

6 do 

7 { 
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8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


1 

2 

3 

4 

5 

9 
8 
9 

10 
11 
12 
13 


ulong  y = x + i ; 
i «=  1; 

ulong  gy  = gray_code(  y ); 
ulong  r = gx  * gy; 

//  Check  that  change  consists  of  exactly  one  bit 

//  of  the  new  and  one  bit  of  the  old  pattern: 

if  ( is_pow_of_2(  r & gy  ) &&  is_pow_of_2(  r & gx  ) ) break; 

//  is_pow_of _2 (x) : =( (x  & -x)  ==  x)  returns  1 also  for  x==0. 

//  But  this  cannot  happen  for  both  tests  at  the  same  time 

> 

while  ( 1 ) ; 
return  y; 

} 

This  version  is  the  fastest:  the  combinations  (^)  are  generated  at  a rate  of  about  96  million  per  second, 
the  combinations  (^o)  a^  a rate  °f  about  83  million  per  second. 

Here  is  another  version  which  needs  the  number  of  set  bits  as  a second  parameter: 

static  inline  ulong  igc_next_minchange_comb (ulong  x,  ulong  k) 

//  Alternative  version,  uses  the  fact  that  the  difference 
//  of  two  successive  x is  the  smallest  possible  power  of  2. 

{ 

ulong  y,  l = 2; 

do 

{ 

y = x + i; 
i «=  1; 

> 

while  ( bit_count(  gray_code(y)  ) !=  k ); 
return  y; 

} 

The  routine  will  be  fast  if  the  CPU  has  a bit-count  instruction.  The  necessary  modification  for  the 
generation  of  the  previous  combination  is  trivial: 

static  inline  ulong  igc_prev_minchange_comb (ulong  x,  ulong  k) 

//  Returns  the  inverse  graycode  of  the  previous  combination  in  minimal-change  order. 

//  Input  must  be  the  inverse  graycode  of  the  current  combination. 

//  With  input==first  the  output  is  the  last  for  n=BITS_PER_L0NG 

{ 

ulong  y,  l = 2; 

do 

{ 

y = x - i; 

i «=  1; 

> 

while  ( bit_count(  gray_code(y)  ) !=  k ); 
return  y; 

} 


1.25  Generating  bit  subsets  of  a given  word 


1.25.1  Counting  order 


To  generate  all  subsets  of  the  set  of  ones  of  a binary  word  we  use  the  sparse  counting  idea  shown  in 


section 


1.8.1  on  page  20  The  implementation  is  [FXT:  class  bit_subset  in  bits/bitsubset.h  : 


class  bit_subset 

{ 

public : 

ulong  u_;  //  current  subset 

ulong  v_;  //  the  full  set 

public : 

bit_subset (ulong  v)  : u_(0),  v_(v)  { ; } 

~bit_subset ()  { ; } 

ulong  current ()  const  { return  u_ ; } 

ulong  next()  { u_  = (u_  - v_)  & v_ ; return  u_;  } 

ulong  prevO  { u_  = (u_  - 1 ) & v_ ; return  u_;  1 
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14  ulong  first (ulong  v)  { v_=v;  u_=0;  return  u_;  } 

15  ulong  firstO  { first(v_);  return  u_;  } 

16 

17  ulong  last (ulong  v)  { v_=v;  u_=v;  return  u_;  } 

18  ulong  lastQ  { last(v_);  return  u_;  } 

19  }; 

With  the  word  [...11.1.]  the  following  sequence  of  words  is  produced  by  subsequent  next  () -calls: 

1. 

1. . . 

...  1.1. 

: : : i : : i : 

. . . ii. . . 

. . .li.i. 


A block  of  ones  at  the  right  will  result  in  the  binary  counting  sequence.  About  1.1  billion  subsets  per 
second  are  generated  with  both  next()  and  prev()  [FXT:  bits/bitsubset-demo.cc  . 


1.25.2  Minimal-change  order 

We  use  a method  to  isolate  the  changing  bit  from  counting  order  that  does  not  depend  on  shifting: 

******+011 1 = u 

* **** **1000  = u+1 

00000001111  = (u+1)  ~ u 

00000001000  = ((u+1)  u)  & (u+1)  < — = bit  to  change 

The  method  still  works  if  the  set  bits  are  separated  by  any  amount  of  zeros.  In  fact,  we  want  to  find 
the  single  bit  that  changed  from  0 to  1.  The  bits  that  switched  from  0 to  1 in  the  transition  from  the 
word  A to  I?  can  also  be  isolated  via  X=B&~A.  The  implementation  is  [FXT:  class  bit_subset_gray  in 
bits/bitsubset-gray.h  : 

1 
2 

3 

4 

5 

6 


public : 

bit_subset_gray (ulong  v)  : S_(v) , g_(0),  h_ (highest_one(v) ) { ; } 

~bit_subset_gray ()  { ; } 

ulong  current ()  const  1 return  g_;  } 
ulong  next () 

{ 

ulong  uO  = S_ . current () ; 
if  ( uO  ==  S_.v_  ) return  firstO; 
ulong  ul  = S_.next(); 
ulong  x = ~u0  & ul ; 
g_  ~=  x; 
return  g_ ; 

} 

ulong  first(ulong  v)  { S_.first(v);  h_=highest_one(v) ; g_=0;  return  g_;  } 

ulong  firstO  { S_. firstO;  g_=0;  return  g_;  } 

[ — snip — ] 

With  the  word  [...11.1.]  the  following  sequence  of  words  is  produced  by  subsequent  next  () -calls: 

1. 

1.1. 

1.  . . 


A block  of  ones  at  the  right  will  result  in  the  binary  Gray  code  sequence,  see  [FXT:  bits/bitsubset-gray- 
demo.cc  . The  method  prev()  computes  the  previous  word  in  the  sequence,  note  the  swapped  roles  of 
the  variables  uO  and  ul: 

1 [ — snip — ] 

2 ulong  prevO 

3 { 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 


class  bit_subset_gray 
{ 

public : 

bit_subset  S_; 

ulong  g_;  //  subsets  in  Gray  code  order 

ulong  h_;  //  highest  bit  in  S_.v_;  needed  for  the  prevO  method 
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9 


4 

5 

6 
7 


8 


ulong  ul  = S_ . current () ; 

if  ( ul  ==  0 ) return  last(); 

ulong  uO  = S_.prev(); 

ulong  x = ~u0  & ul ; 

g_  ~=  x; 

return  g_ ; 


10 

11 

12 


> 


13 


ulong  last (ulong  v)  { S_.last(v);  h_=highest_one (v) ; g_=h_;  return  g_;  } 

ulong  lastO  { S_.last();  g_=h_ ; return  g_;  } 


14  } 


About  365  million  subsets  per  second  are  generated  with  both  nextO  and  prev(). 

1.26  Binary  words  in  lexicographic  order  for  subsets 


1.26.1  Next  and  previous  word  in  lexicographic  order 


1 

1.  . . 

= 

8 

{0} 

2 

11.  . 

= 

12 

fo, 

1} 

3 

111. 

= 

14 

fo, 

1, 

2} 

4 

1111 

= 

15 

{0, 

1, 

2,  3} 

5 

11.1 

= 

13 

{0, 

1. 

3} 

6 

1.1. 

= 

10 

fo, 

2} 

7 

1.11 

= 

11 

{0, 

2, 

3> 

8 

1.  .1 

= 

9 

{0, 

3} 

9 

. 1.  . 

= 

4 

{1} 

10 

.11. 

= 

6 

{1, 

2} 

11 

.111 

= 

7 

{1, 

2, 

3> 

12 

.1.1 

= 

5 

{1, 

3} 

13 

. .1. 

= 

2 

m 

14 

. .11 

= 

3 

{2, 

3} 

15 

. . .1 

- 

1 

{3} 

Figure  1.26-A:  Binary  words  corresponding  to  nonempty  subsets  of  the  4-element  set  in  lexicographic 


order  with  respect  to  subsets.  Note  the  first  element  of  the  subsets  corresponds  to  the  highest  set  bit. 


[0 

— 

0 

*] 

16 

.1.  . .1 

= 

17 

32 

1 1 

= 

33 

48 

11. . 11 

= 

51 

1 

1 

= 

1 

* 

17 

.1.  .11 

= 

19 

33 

1. . . 11 

= 

35 

49 

11. .1. 

= 

50 

2 

11 

= 

3 

18 

.1.  .1. 

= 

18  * 

34 

1. . .1. 

= 

34  * 

50 

11.1.1 

= 

53 

3 

1. 

= 

2 

19 

.1.1.1 

= 

21 

35 

1. .1.1 

= 

37 

51 

11.111 

= 

55 

4 

. . .1.1 

= 

5 

20 

.1.111 

= 

23 

,36 

1. . Ill 

= 

39 

52 

11.11. 

= 

54 

5 

...  Ill 

= 

7 

21 

.1.11. 

= 

22 

37 

1. . 11. 

= 

38 

5.3 

11.1.  . 

= 

52 

6 

. . .11. 

= 

6 

* 

22 

.1.1.  . 

= 

20 

,38 

1. . 1.  . 

= 

36 

54 

111.  . 1 

= 

57 

7 

. . .1.  . 

= 

4 

23 

. 11.  . 1 

= 

25 

,39 

1.1.  .1 

= 

41 

55 

111.11 

= 

59 

8 

. .1.  .1 

= 

9 

24 

.11.11 

= 

27 

40 

1.1.11 

= 

43 

56 

111.1. 

= 

58 

9 

. .1.11 

= 

1 1 

25 

.11.1. 

= 

26 

41 

1.1.1. 

= 

42 

57 

1111.1 

= 

61 

10 

. .1.1. 

= 

10 

* 

26 

.111.1 

= 

29 

42 

1.11.1 

= 

45 

58 

111111 

= 

63 

11 

. .11.1 

= 

13 

27 

. 11111 

= 

31 

43 

1.1111 

= 

47 

59 

11111. 

= 

62 

12 

. .1111 

= 

15 

28 

.1111. 

= 

30 

44 

1.111. 

= 

46 

60 

mi. . 

= 

60  * 

13 

. .111. 

= 

14 

29 

. 111.  . 

= 

28 

45 

1.11.  . 

= 

44 

61 

in. . . 

= 

56 

14 

. . 11.  . 

= 

12 

30 

. 11.  . . 

= 

24 

46 

1.1.  . . 

= 

40 

62 

n 

= 

48 

15 

. .1.  . . 

= 

8 

31 

. 1 

= 

16 

47 

11.  . .1 

= 

49 

63 

i 

= 

32 

Figure  1.26-B:  Binary  words  corresponding  to  the  subsets  of  the  6-element  set,  as  generated  by 


prev_lexrev() . Fixed  points  are  marked  with  asterisk. 


The  (bit-reversed)  binary  words  in  lexicographic  order  with  respect  to  the  subsets  shown  in  figure  1.26-A 
can  be  generated  by  successive  calls  to  the  following  function  [FXT:  bits/bitlex.h  : 

1 static  inline  ulong  next_lexrev(ulong  x) 


2 

3 

4 


//  Return  next  word  in  subset-lex  order. 
{ 


ulong  xO  = x & -x;  //  lowest  bit 


5 

6 
7 


if  ( 1 ! =x0  ) //  easy  case:  set  bit  right  of  lowest  bit 


9 


8 


xO  »=  1; 
x “=  xO; 
return  x ; 
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10 

} 

11 

else 

! //  lowest  bit  at  word  end 

12 

{ 

13 

x “=  xO;  //  clear  lowest  bit 

14 

xO  = x & -x;  //  new  lowest  bit  . . . 

15 

xO  »=  1;  x -=  xO;  //  ...  is  moved 

16 

return  x; 

17 

> 

18  } 

The  bit-reversed  representation  was  chosen  because  the  isolation  of  the  lowest  bit  is  often  cheaper  than 
the  same  operation  on  the  highest  bit.  Starting  with  a one-bit  word  at  position  n — 1,  we  generate  the 
2n  subsets  of  the  word  of  n ones.  The  function  is  used  as  follows  [FXT:  bits/bitlex-demo.cc  : 

ulong  n = 4;  //  n-bit  binary  words 

ulong  x = lUL«(n-l);  //  first  subset 
do 

//  visit  word  x 

> 

while  ( (x=next_lexrev(x) ) ); 

The  following  function  goes  backward: 

1 static  inline  ulong  prev_lexrev (ulong  x) 

2 //  Return  previous  word  in  subset-lex  order. 

3 { 

4 ulong  xO  = x & -x;  //  lowest  bit 


5 

if 

( x & (xO«l) 

) //  easy  case:  next  higher 

6 

7 

x “=  xO; 

// 

clear  lowest  bit 

8 

return  x; 

9 

} 

10 

else 

11 

12 

x +=  xO; 

// 

move  lowest  bit  to  the  left 

13 

x 1=  1; 

// 

set  rightmost  bit 

14 

return  x; 

15 

} 

16  } 

The  sequence  of  all  n-bit  words  is  generated  by  2n  calls  to  prev_lexrev() , starting  with  zero.  The  words 
corresponding  to  subsets  of  the  6-element  set  are  shown  in  figure  1.26-B  The  sequence  [1,  3,  2,  5,  7,  6, 
4,  9,  . . .]  in  the  right  column  is  entry  A108918  in  [31 2]. 

The  rate  of  generation  using  nextO  is  about  274  million  per  second  and  about  253  million  per  second 


with  prev().  An  equivalent  routine  for  arrays  is  given  in  section  8.1.2  on  page  203  The  routines  are 


useful  for  a special  version  of  fast  Walsh  transforms  described  in  section  23.5.3  on  page  472 


1.26.2  Conversion  between  binary  and  lex-ordered  words 

A little  contemplation  on  the  structure  of  the  binary  words  in  lexicographic  order  leads  to  the  routine 
that  allows  random  access  to  the  fc-th  lex-rev  word  (unrank  algorithm)  [FXT:  bits/bitlex.h|: 

1 static  inline  ulong  negidx21exrev (ulong  k) 

2 { 

3 ulong  z = 0; 

4 ulong  h = highest_one(k) ; 

5 while  ( k ) 

6 { 

7 while  ( 0==(h&k)  ) h »=  1; 

8 z “=  h; 

9 ++k; 

10  k &=  h - 1; 

11  > 

12  return  z; 

13  } 

Let  the  inverse  function  be  T(x),  then  we  have  T(0)  =0  and,  with  h(x)  being  the  highest  power  of  2 not 
greater  than  x , 


T(x) 


( T (x  — h(x))  if  x — h(x)  ^ 0 
1 h(x)  otherwise 


(1.26-1) 
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The  ranking  algorithm  starts  with  the  lowest  bit: 

1 static  inline  ulong  lexrev2negidx(ulong  x) 

2 { 

3 if  ( 0==x  ) return  0 ; 

4 ulong  h = x & -x;  //  lowest  bit 

5 ulong  r = (h-1) ; 

6 while  ( x~=h  ) 

7 { 

8 r +=  (h-1) ; 

9 h = x & -x;  //  next  higher  bit 

10  } 

11  r +=  h;  //  highest  bit 

12  return  r; 

13  } 


1.26.3  Minimal  decompositions  into  terms  2k  — 1 f 
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Figure  1.26-C:  Binary  words  in  subset-lex  order  and  their  bit  counts  (left  columns).  The  least  number 
of  terms  of  the  form  2fc  — 1 needed  in  the  sum  x = (2fc  — l)  (right  columns)  equals  the  bit  count. 


The  least  number  of  terms  needed  in  the  sum  x = (2k  — l)  equals  the  number  of  bits  of  the  lex-word 

as  shown  in  figure  |1.26-C|  The  number  can  be  computed  as 

c = bit_count(  negidx21exrev(  x ) ); 


Alternatively,  we  can  subtract  the  greatest  integer  of  the  form  2k  — 1 until  x is  zero  and  count  the  number 
of  subtractions.  The  sequence  of  these  numbers  is  entry  A100661  in  [312]: 


1,2, 1,2, 3, 2, 1,2, 3, 2,3, 4, 3, 2, 1,2, 3, 2, 3,4,3, 2, 3,4, 3, 4, 5,4,3, 2, 1,2,3, 2, 3,. . . 


The  following  function  can  be  used  to  compute  the  sequence: 

1 void  S (ulong  f,  ulong  n)  //  A100661 

2 { 

3 static  int  s = 0; 

4 ++s ; 

5 cout  « s « " , " ; 

6 for  (ulong  m=l;  m<n;  m«=l)  S(f+m,  m)  ; 

7 — s; 

8 cout  « s « " , " ; 

9 } 
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If  called  with  arguments  / = 0 and  n = 2fc,  it  prints  the  first  2k+1  — 1 numbers  of  the  sequence  followed 
by  a zero.  A generating  function  of  the  sequence  is  given  by 


Z(x) 


—1+2(1— x)  nr=i(i+^2"-1) 

(l-x)2 

1 + 2x  + x2  + 2x3  + 3a;4  + 2a;5  + a;6  + 2a;7  + 3a;8  + 2a;9  + 3a;10  + 4a;11  + 3a;12  + 2a;13 


(1.26-2) 

+ ... 


1.26.4  The  sequence  of  fixed  points  f 
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Figure  1.26-D:  Fixed  points  of  the  binary  to  lex-rev  conversion. 


The  sequence  of  fixed  points  of  the  conversion  to  and  from  indices  starts  as 

0,  1,  6,  10,  18,  34,  60,  66,  92,  108,  116,  130,  156,  172,  180,  204,  212, 

228,  258,  284,  300,  308,  332,  340,  356,  396,  404,  420,  452,  514,  540,  556,  ... 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 


This  sequence  is  entry  A079471  in  [312],  The  values  as  bit  patterns  are  shown  in  figure  1.26-D  The 
crucial  observation  is  that  a word  is  a fixed  point  if  it  equals  zero  or  its  bit-count  equals  2 0 where  j is 
the  index  of  the  lowest  set  bit. 

Now  we  can  find  out  whether  a;  is  a fixed  point  of  the  sequence  by  the  following  function: 
static  inline  bool  is_lexrev_f ixed_point (ulong  x) 

//  Return  whether  x is  a fixed  point  in  the  prev_lexrev()  - sequence 


if 


} 

else 
{ 


if  ( 
else 


1 ) 
1==3 


return 

return 


true ; 
false ; 


ulong  w 
if  ( w ! 
if  ( 0==x  ) 
return  0 ! 


bit_count (x) ; 

(w  k -w)  ) return  false; 
return  true ; 

( (x  k -x)  k w ) ; 


Alternatively,  use  either  of  the  following  tests: 


x ==  negidx21exrev(x) 
x ==  lexrev2negidx(x) 


CiOl^COtOl— 1 Cn4^COtO 


74 


Chapter  1:  Bit  wizardry 


1.26.5  Recursive  generation  and  relation  to  a power  series  f 


Start : 1 
Rules : 

0 — > 0 
1 — > 110 


0:  (#=2) 

1 

1 : (#=4) 

110 

2:  (#=8) 

1101100 
3:  (#=16) 

110110011011000 
4'  (#=32) 

' 1101100110110001101100110110000 
5:  (#=64) 

110110011011000110110011011000011011001101100011011001101100000 


Figure  1.26-E:  String  substitution  with  rules  {0  — >•  0, 1 i — >•  110}. 


The  following  function  generates  the  bit-reversed  binary  words  in  reversed  lexicographic  order: 


1 void  C(ulong  f,  ulong  n,  ulong  w) 

{ 

for  (ulong  m=l;  m<n;  m«=l)  C(f+m,  m,  w~m)  ; 
print_bin("  ",  w,  10);  //  visit 

} 


By  calling  C(0,  64,  0)  we  generate  the  list  of  words  shown  in  figure  |1.26-B|  with  the  all-zeros  word 
moved  to  the  last  position.  A slight  modification  of  the  function 


void  A (ulong  f,  ulong  n) 

{ 

cout  « "1,"; 

for  (ulong  m=l;  m<n;  m<<=l)  A(f+m,  m) ; 
cout  <<  "0, " ; 

i 


generates  the  power  series  (sequence  A079559  in  [312} ) 

OO 

n(1  + 2;2"-1)  = 1 + X + x3  + xA  + x7  + x8  + x10  + x11  + x15  + x16  + . . . (1.26-3) 

n—1 


By  calling  A(0,  32)  we  generate  the  sequence 


1,1, 0,1, 1,0, 0,1, 1,0, 1,1, 0,0, 0,1, 1,0, 1,1, 0,0, 1,1, 0,1, 1,0, 0,0,0, 


Indeed,  the  lowest  bit  of  the  k- th  word  of  the  bit-reversed  sequence  in  reversed  lexicographic  order  equals 
the  (k—  l)-st  coefficient  in  the  power  series.  The  sequence  can  also  be  generated  by  the  string  substitution 
shown  in  figure  |1.26-E| 

The  sequence  of  sums,  prepended  by  1, 


1 + x 


nr.i(i+; 


1 — X 


1 + 1 x + 2 x2  + 2 a;3  + 3 xA  + 4 x5  + 4 a:6  + . . . (1.26-4) 


has  series  coefficients 


1,  1,  2,  2,  3,  4,  4,  4,  5,  6,  6,  7,  8,  8,  8,  8,  9,  10,  10,  11, 


12,  12,  12,  13, 


This  sequence  is  entry  A046699  in  [312).  We  have  a(l)  = a(2)  = 1 and  the  sequence  satisfies  the  peculiar 
recurrence 


a(n)  = a(n  — a(n  — 1))  + a(n  — 1 — a(n  — 2))  for  n > 2 (1.26-5) 

1.27  Fibonacci  words  J 

A Fibonacci  word  is  a word  that  does  not  contain  two  successive  ones.  Whether  a given  binary  word  is 
a Fibonacci  word  can  be  tested  with  the  function  [FXT:  bits/fibrep.h 
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1 static  inline  bool  is_f ibrep(ulong  f) 

2 { 

3 return  ( 0==  (f &(f  »1)  ) ); 

4 } 


The  following  functions  convert  between  the  binary  and  the  Fibonacci  representation: 


1 static  inline  ulong  bin2fibrep (ulong  b) 

2 //  Return  Fibonacci  representation  of  b 

3 //  Limitation:  the  first  Fibonacci  number  greater 

4 //  than  b must  be  representable  as  ulong. 

5 //  32  bit:  b < 2971215073=F (47)  [F (48) =4807526976  > 2~32] 

6 //  64  bit:  b < 12200160415121876738=F(93)  [F(94)  > 2~64] 

7 { 


8 

ulong  f0=l,  fl= 

1,  s=l; 

9 

while  ( fl<=b  ) 

{ ulong  t = fO+fl;  f0=f 1 ; 

10 

ulong  f = 0; 

11 

while  ( b ) 

12 

{ 

13 

s »=  1; 

14 

if  ( b>=f 0 

) { b -=  fO;  f ~=s ; > 

15 

{ ulong  t = 

fl-fO;  f l=f0;  f 0=t ; > 

16 

> 

17 

return  f ; 

18 

} 

1 static  inline  ulong  fibrep2bin (ulong  f) 

2 //  Return  binary  representation  of  f 

3 //  Inverse  of  bin2fibrep() . 

4 { 

5 ulong  f0=l,  fl=l; 

6 ulong  b = 0; 

7 while  ( f ) 

8 { 


9 

if  ( f&l  ) b += 

fl; 

10 

{ ulong  t=f0+fl; 

f0=f  1 

11 

f »=  1; 

12 

> 

13 

14  } 

return  b; 

> 


1.27.1  Lexicographic  order 
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Figure  1.27-A:  All  55  Fibonacci  words  with  8 bits  in  lexicographic  order. 


The  8-bit  Fibonacci  words  are  shown  in  figure  [L27-A[  To  generate  all  Fibonacci  words  in  lexicographic 
order,  use  the  function  [FXT:  bits/fibrep.h 

1 static  inline  ulong  next_f ibrep(ulong  x) 

2 //  With  x the  Fibonacci  representation  of  n 

3 //  return  Fibonacci  representation  of  n+1. 

4 { 


5 

//  2 examples: 

// 

ex . 

1 

// 

ex . 2 

6 

// 

// 

x == 

Mo 

010101 

// 

x ==  [*]o 

01010 

7 

ulong  y = x | 

(x»l)  ; 

// 

y == 

M? 

011111 

// 

y ==  M? 

01111 

8 

ulong  z = y + 

1; 

// 

z == 

M? 

100000 

// 

z ==  M? 

10000 

9 

z = z & -z; 

// 

z == 

[0]0 

100000 

// 

z ==  [0] 0 

10000 

10 

x "=  z ; 

// 

X == 

Mo 

110101 

// 

x ==  M0 

11010 

11 

x &=  (z-1) ; 

// 

X == 

Mo 

100000 

// 

x ==  M0 

10000 

13  return  x; 

14  } 
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The  routine  can  be  used  to  generate  all  n-bit  words  as  shown  in  [FXT:  bits/fibrep2-demo.cc  : 

const  ulong  f = 1UL  « n; 
ulong  t = 0; 
do 
{ 

//  visit(t) 
t = next_f ibrep(t) ; 

> 

while  ( t ! =f  ) ; 

The  reversed  order  can  be  generated  via 

ulong  f = 1UL  <<  n; 
do 
{ 

f = prev_f ibrep(f ) ; 

//  visit(f) 

> 

while  ( f ) ; 

which  uses  the  function  (64-bit  version) 

1 static  inline  ulong  prev_fibrep (ulong  x) 

2 //  With  x the  Fibonacci  representation  of  n 

3 //  return  Fibonacci  representation  of  n-1. 

4 { 


5 

//  2 examples: 

II 

ex . 

i 

//  ex. 2 

6 

// 

II 

x == 

Mo 

100000 

//  x ==  M0 

10000 

7 

ulong  y = x & -x; 

II 

y == 

[0]  0 

100000 

//  y ==  CO]  0 

10000 

8 

x ' = y; 

II 

x == 

Mo 

000000 

//  x ==  M0 

00000 

9 

ulong  m = 0x5555555555555555UL ; 

// 

m ==  . . 

.01010101 

10 

if  ( m & y ) m >>=  1; 

II 

m == 

. . .01010101 

//  m ==  . . .0101010 

11 

m &=  (y-1) ; 

II 

m == 

[0]  0 

010101 

//  m ==  [0]  0 

01010 

12 

x ~=  m; 

II 

x == 

Mo 

010101 

//  x ==  M0 

01010 

13 

return  x; 

14 

} 

The  forward  version  generates  about  180  million  words  per  second,  the  backward  version  about  170 
million  words  per  second. 


1.27.2  Gray  code  order  f 


A Gray  code  for  the  binary  Fibonacci  words  (shown  in  figure  1.27-B  I can  be  derived  from  the  Gray  code 
of  the  radix  —2  representations  (see  section  1.22  on  page  58 1 of  binary  words  whose  difference  is  of  the 
form 


3 

1 

1. 

1 . . 

19 

1.  .1 

37 

1.  .1. 

73 

1 . . 1 . . 

147 

1..1..1 

293 

1..1..1. 

The  algorithm  is  to  try  these  values  as  increments  starting  from  the  least,  same  as  for  the  minimal-change 
combination  described  in  section  |1.24.4  on  page  66|  The  next  valid  word  is  encountered  if  it  is  a valid 
Fibonacci  word,  that  is,  if  it  does  not  contain  two  consecutive  set  bits.  The  implementation  is  [FXT: 
class  bit_fibgray  in  bits/bithbgray.h  : 

1 class  bit_fibgray 

2 //  Fibonacci  Gray  code  with  binary  words. 

3 { 

4 public: 

5 ulong  x_;  //  current  Fibonacci  word 

6 ulong  k_;  //  aux 

7 ulong  fw_,  lw_;  //  first  and  last  Fibonacci  word  in  Gray  code 

8 ulong  mw_ ; //  max(fw_,  lw_) 

9 ulong  n_;  //  Number  of  bits 

1?  public: 

12  bit_fibgray (ulong  n) 

13  { 

14  n_  = n; 

15  fw_  = 0; 
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J 

k(j) 

k(j)-k(j-i) 

x=bin2neg(k) 
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li. 

1.1 

= 
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Figure  1.27-B:  Gray  code  for  the  binary  Fibonacci  words  (rightmost  column). 


16 

for  (ulong  m=(lUL<< (n-1) ) ; m!=0 

17 

lw_  = fw_  >>  1; 

18 

if  ( 0==(n&l)  ) { ulong  t=fw_; 

19 

mw_  = ( lw_>fw_  ? lw_  : f w_  ) ; 

20 

x_  = fw_; 

21 

k_  = inverse_gray_code (fw_) ; 

22 

k_  = neg2bin(k_) ; 

23 

> 

24 

25 

~bit_f  ibgrayO  {;} 

26 

27 

ulong  next() 

28 

// 

Return  next  word  in  Gray  code. 

29 

// 

Return  ~0  if  current  word  is  the 

30 

i 

31 

if  ( x_  ==  lw_  ) return  ~0UL; 

32 

ulong  s = n_;  //  shift 

33 

while  ( 1 ) 

34 

{ 

35 

— s; 

36 

ulong  c = 1 | (mw_  » s) ; 

37 

ulong  i = k_  - c ; 

38 

ulong  x = bin2neg(i) ; 

39 

x ~=  (x»l)  ; 

40 

41 

if  ( 0==(x&(x>>1) ) ) //  is 

42 

{ 

43 

k_  = i; 

44 

x_  = x; 

45 

return  x; 

46 

} 

47 

} 

48 

> 

49  }; 

m»=3)  fw_  | = 


m; 


fw_=lw_;  lw_=t ; } //  swap  first/last 


last  one . 


//  possible  difference  for  negbin  word 


About  130  million  words  per  second  are  generated.  The  program  [FXT:  bits/bitfibgray-demo.cc  shows 
how  to  use  the  class,  figure  |1.27-B|  was  created  with  it.  Section  |14.2|  on  page  |305|  gives  a recursive 
algorithm  for  Fibonacci  words  in  Gray  code  order. 
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1.28  Binary  words  and  parentheses  strings  J 


0 

P 

[empty  string] 

[empty  string] 

1 

...IP 

0 

1 

0 

2 

. .1. 

. . .11 

(0) 

3 

. .11  P 

(()) 

. .1.1 

00 

4 

.1.  . 

. . Ill 

((())) 

5 

.1.1  P 

00 

.1.11 

(00) 

6 

. 11. 

.11.1 

0(0) 

7 

.111  P 

((())) 

.1111 

(((()))) 

8 

1.  . . 

1. . 11 

(0)0 

9 

1.  .1 

1.1.1 

000 

10 

1.1. 

1.111 

((()())) 

11 

1.11  P 

(00) 

11.11 

(()(())) 

12 

11.  . 

111.1 

()((())) 

13 

11.1  P 

0(0) 

11111 

((((())))) 

14 

111. 

15 

1111  P 

(((()))) 

Figure  1.28-A:  Left:  some  of  the  4-bit  binary  words  can  be  interpreted  as  a string  parentheses  (marked 


with  ‘P’).  Right:  all  5-bit  words  that  correspond  to  well-formed  parentheses  strings. 


A subset  of  the  binary  words  can  be  interpreted  as  a (well  formed)  string  of  parentheses.  The  4-bit 
binary  words  that  have  this  property  are  marked  with  a ‘P’  in  figure  1.28-A|  (left)  [FXT:  bits/parenword- 
demo.cc  . The  strings  are  constructed  by  scanning  the  word  from  the  low  end  and  printing  a ‘(’  with 
each  one  and  a ‘)’  with  each  zero.  To  find  out  when  to  terminate,  one  adds  up  +1  for  each  opening 
parenthesis  and  —1  for  a closing  parenthesis.  After  the  ones  in  the  binary  word  have  been  scanned,  the 
s closing  parentheses  have  to  be  added  where  s is  the  value  of  the  sum  [FXT:  bits/parenwords.h  : 


1 static  inline  void  parenword2str (ulong  x,  char  *str) 

2 { 

3 int  s = 0 ; 

4 ulong  j = 0; 

5 for  (j=0;  x!=0;  ++j) 

6 { 

7 s +=  ( x&l  ? +1  : -1  ) ; 

8 str  [j]  = ")  ("  [x&l]  ; 

9 x »=  1; 

10  I 

11  while  ( s — > 0 ) str[j++]  = //  finish  string 

12  str[j]  = 0;  //  terminate  string 

13  } 


The  5-bit  binary  words  that  are  valid  ‘paren  words’  together  with  the  corresponding  strings  are  shown 
in  figure  1.28-A  (right).  Note  that  the  lower  bits  in  the  word  (right  end)  correspond  to  the  beginning  of 
the  string  (left  end).  If  a negative  value  for  the  sums  occurs  at  any  time  of  the  computation,  the  word  is 
not  a paren  word.  A function  to  determine  whether  a word  is  a paren  word  is 


1 static  inline  bool  is_parenword (ulong  x) 

2 { 

3 int  s = 0 ; 

4 for  (ulong  j=0;  x!=0;  ++j) 

5 { 

6 s +=  ( x&l  ? +1  : -1  ) ; 

7 if  ( s<0  ) break;  //  invalid  word 

8 x »=  1; 

9 > 

10  return  (s>=0) ; 

11  } 

The  sequence 


1,  3,  5,  7,  11,  13,  15,  19,  21,  23,  27,  29,  31,  39,  43,  45,  47,  51,  53,  55,  59,  61,  63, 


of  nonzero  integers  x so  that  is_parenword(x)  returns  true  is  entry  A036991  in  |312j.  If  we  fix  the 
number  of  paren  pairs,  then  the  following  functions  generate  the  least  and  biggest  valid  paren  words. 
The  first  paren  word  is  a block  of  n ones  at  the  low  end: 


1 

2 

3 


static  inline  ulong  first_parenword (ulong  n) 

//  Return  least  binary  word  corresponding  to  n pairs  of  parens 
//  Example,  n=5 : 11111  ((((())))) 
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4 { 

5 return  f irst_comb(n) ; 

6 } 

The  last  paren  word  is  the  word  with  a sequence  of  n blocks  ‘01’  at  the  low  end: 

1 static  inline  ulong  last_parenword(ulong  n) 

2 //  Return  biggest  binary  word  corresponding  to  n pairs  of  parens . 

3 //  Must  have:  1 <=  n <=  BITS_PER_L0NG/2 . 

4 //  Example,  n=5 : .1.1. 1.1.1  ()()()()() 

5 { 

6 return  0x5555555555555555UL  » (BITS_PER_L0NG-2*n) ; 

7 } 


11111 

= 

((((())))) 

. . .1. 

. .1111 

= 

(((()))()) 

. .1 1111 

= 

(((())))() 

1.1111 

= 

(((()()))) 

. . .1. 

.1.111 

= 

((()())()) 

. .1. . .1.111 

= 

((()()))() 

11.111 

= 

((()(()))) 

. . . 1. 

.11.11 

= 

(()(())()) 

. .1. . .11.11 

= 

(()(()))() 

111.11 

= 

(()((()))) 

. . . 1. 

.111.1 

= 

()((())()) 

. .1. . .111.1 

= 

()((()))() 

1111.1 

= 

()(((()))) 

. . . 1. 

1. .111 

= 

((())()()) 

. .1. .1. .111 

= 

((())())() 

1. .1111 

= 

(((())())) 

. . .1. 

1.1.11 

= 

(()()()()) 

. .1. .1.1.11 

= 

(()()())() 

1.1.111 

= 

((()()())) 

. . .1. 

1.11.1 

= 

()(()()()) 

. .1. .1.11.1 

= 

()(()())() 

1.11.11 

= 

(()(()())) 

. . .1. 

11.  .11 

= 

(())(()()) 

. .1. .11.  .11 

= 

(())(())() 

1.111.1 

= 

()((()())) 

. . .1. 

11.1.1 

= 

()()(()()) 

. .1.  .11.1.1 

= 

()()(())() 

11.  .111 

= 

((())(())) 

. . .11 

. . . Ill 

= 

((()))(()) 

. .1.1. . .111 

= 

((()))()() 

11.1.11 

= 

(()()(())) 

...  11 

. .1.11 

= 

(()())(()) 

. .1.1. .1.11 

= 

(()())()() 

11.11.1 

= 

()(()(())) 

...  11 

. .11.1 

= 

()(())(()) 

. .1.1.  .11.1 

= 

()(())()() 

111. .11 

= 

(())((())) 

...  11 

.1. .11 

= 

(())()(()) 

. .1.1.1. .11 

= 

(())()()() 

111.1.1 

= 

()()((())) 

. . .11 

.1.1.1 

= 

()()()(()) 

. .1.1. 1.1.1 

= 

()()()()() 

Figure  1.28-B:  The  42  binary  words  corresponding  to  all  valid  pairings  of  5 parentheses,  in  colex  order. 


The  sequence  of  all  binary  words  corresponding  to  n pairs  of  parens  in  colex  order  can  be  generated  with 
the  following  (slightly  cryptic)  function: 

1 static  inline  ulong  next_parenword(ulong  x) 


2 

// 

Next 

(colex  order)  binary  word  that  is  a paren 

3 

{ 

4 

if 

( x k 2 ) //  Easy  case,  move  highest  bit 

5 

6 

ulong  b = lowest_zero (x) ; 

7 

x “=  b; 

8 

x “=  (b>>l) ; 

9 

return  x; 

10 

> 

11 

else  //  Gather  all  low  "01"s  and  split  lowest 

12 

i 

13 

if  ( 0==(x  k (x»l))  ) return  0; 

14 

ulong  w = 0;  //  word  where  the  bits  are 

15 

ulong  s = 0;  //  shift  for  lowest  block 

16 

ulong  i = 1;  //  ==  lowest_one(x) 

17 

do  //  collect  low  "01"s: 

18 

{ 

19 

x '=  i; 

20 

1 — 1 

II 

V 

V 

rs 

21 

w 1=1; 

22 

++s ; 

23 

i <<=  2;  //  ==  lowest  one(x); 

24 

> 

25 

while  ( 0==(x&(i«l))  ); 

26 

27 

ulong  z = x (x+i) ; //  lowest  block 

28 

x “=  z; 

29 

z k=  (z>>1) ; 

30 

z &=  (z>>1) ; 

31 

w “=  (z>>s) ; 

32 

x 1 = w; 

33 

return  x; 

34 

y 

35 

} 

The  program  [FXT:  bits/parenword-colex-demo.cc  shows  how  to  create  a list  of  binary  words  corre- 
sponding to  n pairs  of  parens  (code  slightly  shortened) : 

ulong  n = 4;  //  Number  of  paren  pairs 


1 
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2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


ulong  pn  = 2*n+l; 

char  *str  = new  char[n+l];  str  [n]  = 0; 
ulong  x = f irst_parenword(n) ; 
while  ( x ) 

4 

print_bin("  ",  x,  pn) ; 
parenword2str (x,  str); 
cout  <<  " = " « str  <<  endl; 

x = next_parenword(x) ; 

> 


Its  output  with  n = 5 is  shown  in  figure  [T728-B  The  1,767,263,190  paren  words  for  n = 19  are  generated 
at  a rate  of  about  169  million  words  per  second.  Chapter  |15  on  page  323|  gives  a different  formulation  of 
the  algorithm. 


Knuth  |2151  ex. 23,  sect. 7. 1.3]  gives  a very  elegant  routine  for  generating  the  next  paren  word,  the  com- 
ments are  MMIX  instructions: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


static  inline  ulong  next_parenword(ulong  x) 

{ 

const  ulong  mO  = -1UL/3; 


ulong  t = x * mO; 

if  ( (t&x)==0  ) return  0; 

ulong  u = (t-1)  ~ t; 

ulong  v = x | u; 

ulong  y = bit_count(  u & mO  ); 

ulong  w = v + 1 ; 

t = v & ~w; 

y = t » y; 

y +=  w; 

return  y; 


//  X0R  t,  x,  mO; 

//  current  is  last 

//  SUBU  u,  t,  1;  X0R  u,  t,  u; 

//  OR  v,  x,  u; 

//  SADD  y,  u,  mO; 

//  ADDU  w,  v,  1; 

//  ANDN  t,  v,  w; 

//  SRU  y,  t,  y; 

//  ADDU  y,  w,  y; 


The  routine  is  slower,  however,  about  81  million  words  per  second  are  generated.  A bit-count  instruction 
in  hardware  would  speed  it  up  significantly.  Treating  the  case  of  easy  update  separately  as  in  the  other 
version,  we  get  a rate  of  about  137  million  words  per  second. 


1.29  Permutations  via  primitives  J 

We  give  two  methods  to  specify  permutations  of  the  bits  of  a binary  word  via  one  or  more  control  words. 
The  methods  are  suggestions  for  machine  instructions  that  can  serve  as  primitives  for  permutations  of 
the  bits  of  a word. 

1.29.1  A restricted  method 


llllllllilllilll 


llilllli lililin 

. . .1111 mi mi mi 

.11.  .n.  .n.  .ii.  .n.  .n.  .n.  .n 
in. 1.1. in. in. in. in. 1.1. in 


1 bits  15  ... 

1 1 bits  7 ... 

. .1 1 1 1.  . . bits  3 11  . . . 

1...1...1...1...1...1...1...1.  bits  1 5 9 13  . 


:i.i:i.i:i.i:i.i:i.i:i.i:i.i:i.i  bits  o 2 4 6 "8  io  12  14  ... 


Figure  1.29-A  : Mask  with  primitives  for  permuting  bits  with  32-bit  words  (top),  and  words  with  ones 
at  the  highest  bit  of  each  block  (bottom). 


We  can  specify  a subset  of  all  permutations  by  selecting  bit-blocks  of  the  masks  as  shown  for  32-bit  words 
in  figure  1.29-A|  (top).  Subsets  of  the  blocks  of  the  masks  can  be  determined  with  the  bits  of  a word  by 
considering  the  highest  bit  of  each  block  (bottom  of  the  figure).  We  use  all  bits  of  a word  (except  for 
the  highest  bit)  to  select  the  blocks  where  the  bits  defined  by  the  block  and  those  left  to  it  should  be 
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swapped.  An  implementation  of  the  implied  algorithm  is  given  in  [FXT:  bits/bitperml-demo.cc|.  Arrays 
are  used  to  give  more  readable  code: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


void  perml(uchar  *a,  ulong  ldn,  const  uchar  *x) 

//  Permute  a[]  according  to  the  ’control  word’  x[]. 

//  The  length  of  a[]  must  be  2**ldn. 

{ 

long  n = ILCCldn; 

for  (long  s=n/2;  s>0;  s/=2) 

for  (long  k=0;  k<n;  k+=s+s) 

{ 

if  ( x [k+s-1] ! = ’ 0’  ) 

{ 

//  swap  regions  [a+k, . . . ,a+k+s-l] , [a+k+s, . . . , a+k+2*s-l] : 
swap(a+k,  a+k+s,  s) ; 

} 

} 

} 


The  routine  for  the  inverse  permutation  differs  in  a single  line: 
for  (long  s=l;  s<n;  s+=s) 


No  attempt  has  been  made  to  optimize  or  parallelize  the  algorithm.  We  just  explore  how  useful  a machine 
instruction  for  the  permutation  of  bits  would  be. 


The  program  uses  a fixed  size  of  16  bits,  an  ‘x’  is  printed  whenever  the  corresponding  bit  is  set: 


a=0123456789ABCDEF 

x=0010011000110110 


7 

3 1 lx 

1 5x  9 

0 2x  4 


a=01326754CDFEAB98 


bits  of  the  input  word 
control  word 


6x  8 lOx  12  14x 

result 


This  control  word  leads  to  the  Gray  permutation  (see  2.12  on  page  128 1.  Assume  we  use  words  with  N 
bits.  We  cannot  (for  N > 2)  specify  all  N\  permutations  as  we  can  choose  between  only  2N~1  control 
words.  Now  set  the  word  length  to  N :=  2n.  The  reachable  permutations  are  those  where  the  intervals 
[k  ■ 2J  , . . . , (k  + 1)  • 2-1  — 1]  contain  all  numbers  [p  ■ 2J  , . . . , (p  + 1)  • 2J  — 1]  for  all  j <n  and  0 < k < 2n-J  , 
choosing  p for  each  interval  arbitrarily  (0  < p < 2n~J ) . For  example,  the  lower  half  of  the  permuted  array 
must  contain  a permutation  of  either  the  lower  or  the  upper  half  (j  = n — 1)  and  each  pair  a2y,  U2y+i 
must  contain  two  elements  2 z,  2z  + l ( j = 1).  The  bit-reversal  is  computed  with  a control  word  where  all 
bits  are  set.  Alas,  the  (important!)  zip  permutation  (bit-zip,  see  section  1.15  on  page  38)  is  unreachable. 

A machine  instruction  could  choose  between  the  two  routines  via  the  highest  bit  in  the  control  word. 


1.29.2  A general  method 


All  permutations  of  N = 2n  elements  can  be  specified  with  n control  words  of  N bits.  Assume  we  have 
a machine  instruction  that  collects  bits  according  to  a control  word.  An  eight  bit  example: 


a = abcdefgh 
x = . .1.11.1 
cefh 

abdg 

abdgcefh 


input  data 

control  word  (dots  for  zeros) 
bits  of  a where  x has  a one 
bits  of  a where  x has  a zero 
result,  bits  separated  according  to  x 


We  need  n such  instructions  that  work  on  all  length-2fe  sub-words  for  1 < k < n.  For  example,  the 
instruction  working  on  half  words  of  a 16-bit  word  would  work  as 


a = abcdefgh  ABCDEFGH 

x = . .1.11.1  1111 

cefh  ABCD 


abdg  EFGH 
abdgcefh  EFGH ABCD 


input  data 

control  word  (dots  for  zeros) 
bits  of  a where  x has  a one 
bits  of  a where  x has  a zero 
result,  bits  separated  according  to  x 


Note  the  bits  of  the  different  sub- words  are  not  mixed.  Now  all  permutations  can  be  reached  if  the  control 
word  for  the  2fc-bit  sub-words  have  exactly  2k~l  bits  set  in  all  ranges  [j  ■ 2k,  . . . , (j  + 1)  • 2k ]. 
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A control  word  together  with  the  specification  of  the  instruction  used  defines  the  action  taken.  The 
following  leads  to  a swap  of  adjacent  bit  pairs 

1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1.  k=  1 (2-bit  sub-words) 

while  this 

1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1. 1.1.  k=  5 (32  bit  sub-words) 

results  in  gathering  the  even  and  odd  indexed  bits  in  the  halfwords. 

A complete  set  of  permutation  primitives  for  16-bit  words  and  their  effect  on  a symbolic  array  of  bits 
(split  into  groups  of  four  elements  for  readability)  is 


11111111 

k=  4 

==> 

0123 

89ab 

4567 

cdef 

89ab 

0123 

cdef 

4567 

1111 1111 

k=  3 

= = > 

cdef 

89ab 

4567 

0123 

11. .11. .11. .11. . 

k=  2 

= = > 

ef  cd 

ab89 

6745 

2301 

1.1. 1.1. 1.1. 1.1. 

k=  1 

= = > 

f edc 

ba98 

7654 

3210 

The  top  primitive  leads  to  a swap  of  the  left  and  right  half  of  the  bits,  the  next  to  a swap  of  the  halves  of 
the  half  words  and  so  on.  The  computed  permutation  is  array  reversal.  Note  that  we  use  array  notation 
(least  index  left)  here. 

The  resulting  permutation  depends  on  the  order  in  which  the  primitives  are  used.  When  starting  with 
full  words  we  get: 


1.1. 

1.1. 

1.1. 

1.1. 

k=  4 

= = > 

0123 

1357 

4567 

9bdf 

89ab 

0246 

cdef 

8ace 

1.1. 

1.1. 

1.1. 

1.1. 

k=  3 

= = > 

37bf 

159d 

26ae 

048c 

1.1. 

1.1. 

1.1. 

1.1. 

k=  2 

= = > 

7f  3b 

5dl9 

6e2a 

4c08 

1.1. 

1.1. 

1.1. 

1.1. 

k=  1 

= = > 

f 7b3 

d591 

e6a2 

c480 

result 

is  different  when  starting  with  2-bit  sub-words: 

1.1. 

1.1. 

1.1. 

1.1. 

k=  1 

==> 

0123 

1032 

4567 

5476 

89ab 

98ba 

cdef 
dcf  e 

1.1. 

1.1. 

1.1. 

1.1. 

k=  2 

==> 

0213 

4657 

8a9b 

cedf 

1.1. 

1.1. 

1.1. 

1.1. 

k=  3 

==> 

2367 

0145 

abef 

89cd 

1.1. 

1.1. 

1.1. 

1.1. 

k=  4 

==> 

3715 

bf  9d 

2604 

ae8c 

There  are  (2,z)  possibilities  to  have  z bits  set  in  a 2z-bit  word.  There  are  2n  k length-2fc  sub-words  in  a 
2"-bit  word  so  the  number  of  valid  control  words  for  that  step  is 


The  product  of  the  number  of  valid  words  in  all  steps  gives  the  number  of  permutations: 


(2n)! 


(1.29-1) 


1.30  CPU  instructions  often  missed 

1.30.1  Essential 

• Bit-shift  and  bit-rotate  instructions  that  work  properly  for  shifts  greater  than  or  equal  to  the  word 
length:  the  shift  instruction  should  zero  the  word,  the  rotate  instruction  should  take  the  shift 
modulo  word  length.  The  C-language  standards  leave  the  results  for  these  operations  undefined 
and  compilers  simply  emit  the  corresponding  assembler  instructions.  The  resulting  CPU  dependent 
behavior  is  both  a source  of  errors  and  makes  certain  optimizations  impossible. 

• A bit-reverse  instruction.  A fast  byte-swap  mitigates  the  problem,  see  section  [1.14  on  page  33| 

• Instructions  that  return  the  index  of  highest  or  lowest  set  bit  in  a word.  They  must  execute  fast. 

• Fast  conversion  from  integer  to  float  and  double  (both  directions). 
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• A fused  multiply-add  instruction  for  floats. 

• Instructions  for  the  multiplication  of  complex  floating-point  numbers,  computing  A ■ C — B ■ D and 
A ■ D + B ■ C from  A,  B , C,  and  D. 

• A sum-diff  instruction,  computing  A + B and  A — B from  A and  B.  This  can  serve  as  a primitive 
for  fast  orthogonal  transforms. 

• An  instruction  to  swap  registers.  Even  better,  a conditional  version  of  that. 


1.30.2  Nice  to  have 

• A parity  bit  for  the  complete  machine  word.  The  parity  of  a word  is  the  number  of  bits  modulo  2, 


not  the  complement  of  it.  Even  better,  an  instruction  for  the  inverse  Gray  code,  see  section  1.16 
|on  pageTT] 

A bit-count  instruction,  see  section  [1.8  on  page  18[  This  would  also  give  the  parity  at  bit  zero. 

An  instruction  for  computing  the  index  of  the  i-th  set  bit  of  a word,  see  section  |1.10|  on  page  [25[ 
This  would  be  useful  even  if  execution  takes  a dozen  cycles. 


A random  number  generator,  LHCAs  (see  section  41.8  on  page  8781  may  be  candidates.  At  the 
very  least:  a decent  entropy  source. 

A conditional  version  of  more  than  just  the  move  instruction,  possibly  as  an  instruction  prefix. 

A bit-zip  and  a bit-unzip  instruction,  see  section 1 1.1 5 on  page  38|  Note  this  is  polynomial  squaring 
over  GF(2). 

Primitives  for  permutations  of  bits,  see  section  |1.29.2  on  page  8~T[  A bit-gather  and  a bit-scatter 
instruction  for  sub- words  of  all  sizes  a power  of  2 would  allow  for  arbitrary  permutations  (see  [FXT : 
bits/bitgather.h  and  [FXT:  bits/bitseparate.h  for  versions  working  on  complete  words). 

Multiplication  corresponding  to  XOR  as  addition.  This  is  the  multiplication  without  carries  used 


for  polynomials  over  GF(2),  see  section  40.1  on  page  822 


1.31  Some  space  filling  curves  | 


1.31.1  The  Hilbert  curve 


A rendering  of  the  Hilbert  curve  (named  after  David  Hilbert  [182)1  is  shown  in  figure  1.31-A  An  efficient 
algorithm  to  compute  the  direction  of  the  n-th  move  of  the  Hilbert  curve  is  based  on  the  parity  of  the 
number  of  threes  in  the  radix-4  representation  of  n (see  section  38.9.1  on  page  748). 


Let  dx  and  dy  correspond  to  the  moves  at  step  n in  the  Hilbert  curve.  Then  dxidy  € {— 1,0, +1}  and 
exactly  one  of  them  is  zero.  So  for  both  p :=  dx  + dy  and  m :=  dx  — dy  we  have  p,m  £ { — 1,  +1}. 

The  following  function  computes  p and  returns  0, 1 if  p = — 1,  +1,  respectively  [FXT:  bits/hilbert.h  : 

1 static  inline  ulong  hilbert_p (ulong  t) 

2 //  Let  dx,dy  be  the  horizontal, vertical  move 

3 //  with  step  t of  the  Hilbert  curve. 

4 //  Return  zero  if  (dx+dy)==-l,  else  one  (then:  (dx+dy)==+l) . 

5 //  Algorithm:  count  number  of  threes  in  radix  4 

6 { 

7 ulong  d = (t  & 0x5555555555555555UL)  & ( (t  & OxaaaaaaaaaaaaaaaaUL)  » 1); 

8 return  parity ( d ) ; 

9 } 

If  1 is  returned  the  step  is  to  the  right  or  upwards.  The  function  can  be  slightly  optimized  as  follows 
(64-bit  version  only): 


1 static  inline  ulong  hilbert_p (ulong  t) 

2 { 

3 t &=  ((t  & OxaaaaaaaaaaaaaaaaUL)  » 1) ; 


cooo-dC5Cn 
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Figure  1.31-A:  The  first  255  segments  of  the  Hilbert  curve. 


Figure  1.31-B:  Moves  and  turns  of  the  Hilbert  curve. 


t ~=  t»2; 
t ~=  t»4; 
t ~=  t»8; 
t ~=  t»16; 
t ~=  t»32; 
return  t & 1 ; 

10  } 

The  corresponding  value  for  m can  be  computed  as: 

1 static  inline  ulong  hilbert_m(ulong  t) 

2 //  Let  dx,dy  be  the  horizontal, vertical  move 

3 //  with  step  t of  the  Hilbert  curve. 

4 //  Return  zero  if  (dx-dy)==-l,  else  one  (then:  (dx-dy)==+l) . 

5 { 

6 return  hilbert_p(  -t  ); 

7 } 

If  the  values  for  p and  m are  equal  the  step  is  in  horizontal  direction.  It  remains  to  merge  the  values  of 
p and  to  into  a 2-bit  value  d that  encodes  the  direction  of  the  move: 

1 static  inline  ulong  hilbert_dir (ulong  t) 

2 //  Return  d encoding  the  following  move  with  the  Hilbert  curve. 

3 // 

4 //  d \in  {0, 1,2,3}  as  follows: 


5 

// 

d 

direction 

6 

// 

0 

right 

(+x: 

dx=+l , 

dy=  0) 

7 

// 

1 

down 

(-y: 

dx=  0 , 

dy=-l) 

8 

// 

2 

up 

(+y: 

dx=  0 , 

dy=+l) 

9 

// 

3 

left 

(-x: 

dx=-l , 

dy=  0) 

10  { 


1.31:  Some  space  filling  curves  | 


85 


11 

ulong  p 

= hilbert_p(t) ; 

12 

ulong  m 

= hilbert_m(t) ; 

13 

ulong  d 

= p (m<<l) ; 

14 

return 

d; 

15 

} 

To  print  the  value  of  d symbolically,  we  can  print  the  value  of  (">v~<")  [d] . The  sequence  of  moves  can 
also  be  generated  by  the  string  substitution  process  shown  in  figure  |1.31-C 


Start : 
Rules : 


> 
— > 
— > 
— > 
— > 
— > 
— > 
— > 


D>A~A<C 

C<BvB>D 

BvC<C~A 

A~D>DvB 

> 

< 


0:  (#=1) 

A 

1:  (#=7) 

D>A~A<C 
2:  (#=31) 

A~D>DvB>D>A~A<C~D>A~A<C<BvC<C''A 
3:  (#=127) 

D>A~A<C~A~D>DvB>A~D>DvBvC<BvB>D>A~D>DvB>D>A~A<C''D>A~A<C<BvC<C~A''A''D>DvB>D>A~A<C~D>  . . . 


Figure  1.31-C:  Moves  of  the  Hilbert  curve  by  a string  substitution  process,  the  symbols  ‘A’,  ‘B’,  ‘C’,  and 


lD\  are  ignored  when  drawing  the  curve. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


The  turn  u between  steps  can  be  computed  as 

static  inline  int  hilbert_turn(ulong  t) 

//  Return  the  turn  (left  or  right)  with  the  steps 
//  t and  t-1  of  the  Hilbert  curve. 

//  Returned  value  is 
//  0 for  no  turn 

//  +1  for  right  turn 

//  -1  for  left  turn 

{ 

ulong  dl  = hilbert_dir (t) ; 
ulong  d2  = hilbert_dir (t-1) ; 
dl  ~=  (dl»l)  ; 
d2  ~=  (d2»l)  ; 
ulong  u = dl  - d2; 

//  at  this  point,  symbolically:  cout  « ("+.-0+.-")[  u + 3 ]; 

if  ( 0==u  ) return  0; 
if  ( (long)u<0  ) u +=  4; 
return  (l==u  ? +1  : -1) ; 

} 


To  print  the  value  of  u symbolically,  we  can  print  ("-0+")  [u+1]  ; . 


The  values  of  p and  m,  followed  by  the  direction  and  turn  of  the  Hilbert  curve  are  shown  in  figure  [L31-B 


The  list  was  created  with  the  program  [FXT:  bits/hilbert-moves-demo.cc  . Figure  1.31-A  was  created  with 
the  program  [FXT:  bits/hilbert-texpic-demo.cc|.  The  computation  of  a function  whose  series  coefficients 
are  ±1  and  ±i  according  to  the  Hilbert  curve  is  described  in  section  [38.9  on  page  747[ 


A finite  state  machine  (FSM)  for  the  conversion  from  a 1-dimensional  coordinate  (linear  coordinate  of 
the  curve)  to  the  pair  of  coordinates  x and  y of  the  Hilbert  curve  is  described  in  [351  item  115].  At  each 
step  two  bits  of  input  are  processed.  The  array  htab[]  serves  as  lookup  table  for  the  next  state  and  two 
bits  of  the  result.  The  FSM  has  an  internal  state  of  two  bits  [FXT:  bits/lin2hilbert.cc  : 

void 

lin2hilbert (ulong  t,  ulong  &x,  ulong  &y) 

//  Transform  linear  coordinate  t to  Hilbert  x and  y 

{ 

ulong  xv  = 0,  yv  = 0; 

ulong  cOl  = (0<<2)  ; //  (2«2)  for  transposed  output  (swapped  x,  y) 

for  (ulong  i=0 ; i< (BITS_PER_L0NG/2) ; ++i) 

1 

ulong  abi  = t » (BITS_PER_L0NG-2) ; 
t «=  2; 
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ulong  st  = htab[  (c01«2)  I abi  ]; 
cOl  = st  & 3; 

yv  «=  1; 

yv  | = C (st>>2)  & 1) ; 
xv  «=  1 ; 
xv  |=  (st»3)  ; 

} 

x = xv ; y = yv; 


OLD 

NEW 

NEW 

OLD 

C 

C 

A 

B 

X 

Y 

c 

c 

c 

c 
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Y 
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1 
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1 

Figure  1.31-D:  The  original  table  from  [33]  for  the  finite  state  machine  for  the  2-dimensional  Hilbert 


curve  (left).  All  sixteen  Tbit  words  appear  in  both  the  ‘OLD’  and  the  ‘NEW’  column.  So  the  algorithm  is 
invertible.  Swap  the  columns  and  sort  numerically  to  obtain  the  two  columns  at  the  right,  the  table  for 
the  inverse  function. 


11 

12 

13 

15 

16 

17 

18 

19 

20 

21  } 


The  table  used  is  defined  (see  figure  1.31-D)  as 


1 static  const  ulong  htabf]  = { 

2 #define  HT(xi ,yi , cO, cl)  ( (xi<<3)+(yi<<2)+(c0<<l)+(cl) ) 


3 

//  : 

index  = 

= HT(c0, 

4 

HT  ( 

0, 

0, 

1, 

0 

), 

5 

HT  ( 

0, 

1, 

0, 

0 

), 

6 

HT  ( 

1, 

1, 

0, 

0 

), 

7 

HT  ( 

1, 

0, 

0, 

1 

), 

8 

[ — snip— 

-1 

9 

HT  ( 

0, 

0, 

1, 

1 

), 

10 

HT  ( 

0, 

1, 

1, 

0 

) 

11  >; 


As  indicated  in  the  code,  the  table  maps  every  four  bits  cO,cl,ai,bi  to  four  bits  xi,yi,cO,cl. 
table  for  the  inverse  function  (again,  see  figure  1.31-D)  is 


The 


1 static  const  ulong  ihtab[]  = { 

2 #define  IHT(ai,bi,cO,cl)  ( (ai<<3)  + (bi«2)  + (cO«l)  + (cl)  ) 

3 //  index  ==  HT(cO, cl ,xi ,yi) 

4 IHT(  0,0,  1,0), 

5 IHT(  0,1,  0,0), 

6 IHT(  1,  1,  0,  1 ), 

7 IHT(  1,0,  0,0), 

8 [ — snip — ] 

9 IHT(  0,1,  1,1), 

10  IHT(  0,  0,  0,  1 ) 

11  >; 


The  words  have  to  be  processed  backwards: 

1 ulong 

2 hilbert21in (ulong  x,  ulong  y) 

3 //  Transform  Hilbert  x and  y to  linear  coordinate  t 

4 { 

5 ulong  t = 0; 

6 ulong  cOl  = 0; 
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7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


25  } 


for  (ulong  i=0 ; i< (BITS_PER_L0NG/2) ; ++i) 

{ 

t «=  2; 

ulong  xi  = x » (BITS_PER_L0NG/2-l) ; 
xi  &=  1; 

ulong  yi  = y » (BITS_PER_L0NG/2-l) ; 
yi  &=  1; 

ulong  xyi  = (xi<<l)  I yi; 
x «=  1; 

y «=  l; 

ulong  st  = ihtab[  (c01«2)  I xyi  ]; 
cOl  = st  & 3; 

t | = (st»2)  ; 

> 

return  t ; 


1.31.2  The  Z-order 


A 2-dimensional  space-filling  curve  in  Z-order  traverses  all  points  in  each  quadrant  before  it  enters  the 
next.  Figure|1.31-E  shows  a rendering  of  the  Z-order  curve,  created  with  the  program  [FXT:  bits/zorder- 
texpic-demo.cc  . The  conversion  between  a linear  parameter  to  a pair  of  coordinates  is  done  by  separating 
the  bits  at  the  even  and  odd  indices  [FXT;  bits/zorder.h  : 

static  inline  void  lin2zorder (ulong  t,  ulong  &x,  ulong  fey)  { bit_unzip2(t , x,  y) ; } 


The  routine  bit_unzip2()  is  described  in  section  1.15  on  page  38  The  inverse  is 
static  inline  ulong  zorder21in (ulong  x,  ulong  y)  { return  bit_zip2(x,  y) ; } 
The  next  pair  can  be  computed  with  the  following  (constant  amortized  time)  routine: 

1 static  inline  void  zorder_next (ulong  &x,  ulong  &y) 

2 { 
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3 

4 

5 

6 

7 

8 
9 

10 

11 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


1 

2 

3 

4 

5 

6 


ulong  b = 1 ; 
do 

4 

x “=  b;  b &=  ~x; 
y “=  b;  b &=  ~y; 
b <<=  1; 

> 

while  ( b ) ; 


The  previous  pair  is  computed  similarly: 

static  inline  void  zorder_prev (ulong  &x,  ulong  &y) 

{ 

ulong  b = 1 ; 
do 

4 

x “=  b;  b &=  x; 
y “=  b;  b &=  y; 
b <<=  1; 

> 

while  ( b ) ; 

} 

The  routines  are  written  in  a way  that  generalizes  easily  to  more  dimensions: 
static  inline  void  zorder3d_next (ulong  Sex,  ulong  Sey , ulong  Sez) 

4 

ulong  b = 1 ; 
do 

4 

x “=  b;  b &=  ~x; 
y ~=  b;  b &=  ~y; 
z “=  b;  b Se=  ~z; 
b <<=  1; 

> 

while  ( b ) ; 


static  inline  void  zorder3d_prev (ulong  Sex,  ulong  fey,  ulong  Sez) 

4 

ulong  b = 1 ; 
do 

4 

x “=  b;  b Se=  x; 
y ~=  b;  b Se=  y; 
z ~=  b;  b Se=  z; 
b «=  1; 

} 

while  ( b ) ; 


Unlike  with  the  Hilbert  curve  there  are  steps  where  the  curve  advances  more  than  one  unit. 


1.31.3  Curves  via  paper-folding  sequences 

The  paper-folding  sequence , entry  A014577  in  [312],  starts  as  [FXT:  bits/bit-paper-fold-demo. cc  : 

11011001110010011101100011001001110110011100100011011000110010011  . . . 

The  fc-th  element  (fc  > 0)  is  one  if  k = 24  • [Au  + 1),  entry  A091072  in  [3121] : 

1,  2,  4,  5,  8,  9,  10,  13,  16,  17,  18,  20,  21,  25,  26,  29,  32,  33,  ... 

The  k- th  element  of  the  paper-folding  sequence  can  be  computed  by  testing  the  value  of  the  bit  left  to 
the  lowest  (that  is,  rightmost)  one  in  the  binary  expansion  of  k [FXT:  bits/bit-paper-fold.h|: 

static  inline  bool  bit_paper_f old(ulong  k) 

4 

ulong  h = k & -k;  //  ==  lowest_one (k) 

k &=  (h«l)  ; 
return  ( k==0  ) ; 

} 

About  550  million  values  per  second  are  generated.  We  use  bool  as  return  type  to  indicate  that  only 
zero  or  one  is  returned.  The  value  can  be  used  as  an  integer  of  arbitrary  type,  there  is  no  need  for  a cast. 
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Figure  1.31-F:  The  first  1024  segments  of  the  dragon  curve  (two  different  renderings). 


90 


Chapter  1:  Bit  wizardry 


1.31.3.1  The  dragon  curve 


Another  name  for  the  sequence  is  dragon  curve  sequence,  because  a space  filling  curve  known  as  dragon 
curve  (or  Heighway  dragon ) can  be  generated  if  we  interpret  a one  as  ‘turn  left’  and  a zero  as  ‘turn  right’. 


The  top  of  figure  1.31-F 


curve-texpic-demo . cc 
(left)  turn  A — » B — » C: 


shows  the  first  1024  segments  of  the  curve  (created  with  [FXT:  bits/dragon- 
As  some  points  are  visited  twice  we  draw  the  turns  with  cut  off  corners,  for  the 


A — 


drawn  as 


C 

I 

I 

/ 

A ~/B 


The  code  is  given  in  [FXT:  auxO/tex-line.cc  . The  first  few  moves  of  the  curve  can  be  found  by  repeatedly 
folding  a strip  of  paper.  Always  pick  up  the  right  side  and  fold  to  the  left.  Unfold  the  paper  and  adjust 
all  corners  to  be  90  degrees.  This  gives  the  first  few  segments  of  the  dragon  curve. 

When  all  angles  are  replaced  by  diagonals  between  the  midpoints  of  the  lines 
C C 

I 

I drawn  as  / 

I / 

A — B A / B 


then  the  curve  appears  as  shown  at  the  bottom  of  figure  |1.31-F[ 


Start:  0 
Rules : 

0 — > 01 
1 — > 21 

2 — > 23 

3 — > 03 

0 

0 

1 

01 

2 

0121 

3 

01212321 

4 

0121232123032321 

5 

01212321230323212303010323032321 

6 

0121232123032321230301032303232123030103012101032303010323032321 

+'~-''-v-~-v+v-v-~-v+v+~+v-v+v-v-~-v+v+~+v+~_~+~+v-v+v+~+v-v+v-v-~ 

Figure  1.31-G:  Moves  of  the  dragon  curve  generated  by  a string  substitution  process. 


The  net  rotation  of  the  dragon-curve  after  k steps,  as  multiple  of  the  right  angle,  can  be  computed  by 
counting  the  ones  in  the  Gray  code  of  k.  Take  the  result  modulo  4 to  ignore  multiples  of  360  degree 
[FXT:  bits/bit-paper-fold. h : 

1 static  inline  bool  bit_dragon_rot  (ulong  k)  { return  bit_count(  k “ (k»l)  ) & 3;  } 

The  sequence  of  rotations  is  entry  A005811  in  [312]: 

seq  =0121232123432321234345432343232123  ... 
mod  4=0121232123032321230301032303232123... 
move  =+~-~-v-~-v+v-v-~-v+v+~+v-v+v-v-~-v... 


The  sequence  of  moves  (as  symbols,  last  row)  can  be  computed  with  [FXT:  bits/dragon-curve-moves- 

on  page  [744 


demo.cc  . A function  related  to  the  paper-folding  sequence  is  described  in  section  38.8.3 


1.31.3.2  The  alternate  paper-folding  sequence 

If  the  strip  of  paper  is  folded  alternately  from  the  left  and  right,  then  another  paper-folding  sequence  is 
obtained.  It  is  entry  A106665  in  1312]  and  it  starts  as  [FXT:  bits/bit-paper-fold-alt-demo. cc  : 

10011100100011011001110110001100100111001000110010011101100011011  . . . 

Compute  the  sequence  via  [FXT:  bits /bit-paper- fold. h| 

1 static  inline  bool  bit_paper_f old_alt (ulong  k) 

2 { 

3 ulong  h = k & -k;  //  ==  lowest_one (k) 


Start:  0 
Rules : 

0  — > 01 

1 — > 03 

2 — > 23 

3 — > 21 


0:  0 
1:  01 
2:  0103 
3:  01030121 
4:  0103012101032303 
5:  01030121010323030103012123210121 
6 : 01030121010323030103012123210121 
+~+v+~-~+~+v-v+v+~+v+~-~-v-~+~-~ 
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Chapter  1:  Bit  wizardry 


Start : L 
Rules : 

L — > L+R+L-R 
R — > L+R-L-R 
+ — > + 

- — > - 


0:  (#=1) 

1 : L (#=7) 

L+R+L-R 
2:  (#=31) 

L+R+L-R+L+R-L-R+L+R+L-R-L+R-L-R 
3:  (#=127) 

L+R+L-R+L+R-L-R+L+R+L-R-L+R-L-R+L+R+L-R+L+R-L-R-L+R+L-R-L+R-L-R+L+R+L-R+L+R-L-R+L+  . . . 


Start : L 
Rules : 

L — > R+L+R-L 
R — > R+L-R-L 
+ — > + 

- — > - 


0:  (#=1) 

1 : L (#=7) 

R+L+R-L 
2:  (#=31) 

R+L-R-L+R+L+R-L+R+L-R-L-R+L+R-L 
3:  (#=127) 

R+L-R-L+R+L+R-L-R+L-R-L-R+L+R-L+R+L-R-L+R+L+R-L+R+L-R-L-R+L+R-L+R+L-R-L+R+L+R-L-R+  . . . 


Figure  1.31-J:  Moves  and  turns  of  the  dragon  curve  (top)  and  alternate  dragon  curve  (bottom). 


4 h «=  l; 

5 ulong  t = h & (k  OxaaaaaaaaUL) ; //  32-bit  version 

6 return  ( t ! =0  ) ; 

7 } 

About  413  million  values  per  second  are  generated.  By  interpreting  the  sequence  of  zeros  and  ones  as 
turns  we  again  obtain  triangular  space-filling  curves  shown  in  figure  H 3Th1  The  orientations  can  be 
computed  as 

1 static  inline  ulong  bit_paper_fold_alt_rot (ulong  k) 

2 //  Return  total  rotation  (as  multiple  of  the  right  angle) 

3 //  after  k steps  in  the  alternate  paper-folding  curve. 


4 

// 

k= 

0, 

1,  2,  3, 

4,  5, 

5 

// 

cn 

(D 

II 

0, 

1,  0,  3, 

0,  1, 

to 

o 

1,  0,  3,  2,  3,  0 

6 

// 

move  = 

+ 

~ + V 

+ 

+ 

+ 

> 

1 

> 

+ 

< 

7 

// 

(+==right , 

-==left , 

“==up, 

, v==down) 

8 

// 

Algorithm: 

count  the 

! ones 

in  (w 

gray_code (k) ) . 

9 { 

10  const  ulong  w = OxaaaaaaaaUL;  //  32-bit  version 

11  return  bit_count(  w * (k  (k»l))  ) & 3;  //  modulo  4 

12  } 


Figure  1.31-J  shows  a different  string  substitution  process  for  the  generation  of  the  rotations  (symbols 
*+’  and  ‘-’)  for  the  paper-folding  sequences,  both  symbols  ‘L’  and  ‘R’  are  interpreted  as  a unit  move  in 
the  current  direction. 


If  the  constant  in  the  routine  is  replaced  by  a parameter  w,  then  its  bits  determine  whether  a left  or  a 
right  fold  was  made  at  each  step: 

1 static  inline  bool  bit_paper_fold_general (ulong  k,  ulong  w) 

2 { 

3 ulong  h = k k -k;  //  ==  lowest_one (k) 

4 h «=  1; 

5 ulong  t = h k (k“w) ; 

6 return  ( t ! =0  ) ; 

7 } 


1.31.4  Terdragon  and  hexdragon 

The  terdragon  curve  turns  to  the  left  or  right  by  120  degrees  depending  to  the  sequence 

0,  1,  0,  0,  1,  1,  0,  1,  0,  0,  1,  0,  0,  1,  1,  0,  1,  1,  0,  1,  0,  0,  1,  1,  0,  1, 
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Chapter  1:  Bit  wizardry 


Start : 0 
Rules : 

0 — > 010 
1 — > Oil 


0:  (#=1) 

0 

1:  (#=3) 

010 

2:  (#=9) 

010011010 
3:  (#=27) 

010011010010011011010011010 
4:  (#=81) 

010011010010011011010011010010011010010011011010011011010011010010011011010011010 


Start : F 
Rules : 

F — > F+F-F 
+ — > + 

- — > - 


0:  (#=1) 

F 

1:  (#=5) 

F+F-F 

2:  (#=17) 

F+F-F+F+F-F-F+F-F 
3:  (#=53) 

F+F-F+F+F-F-F+F-F+F+F-F+F+F-F-F+F-F-F+F-F+F+F-F-F+F-F 
4:  (#=161) 

F+F-F+F+F-F-F+F-F+F+F-F+F+F-F-F+F-F-F+F-F+F+F-F-F+F-F+F+F-F+F+F-F-F+F-F+F+F-F+F+F-  . . . 


Figure  1.31-M:  Turns  of  the  terdragon  curve,  generated  by  string  substitution  (top),  alternative  process 
for  the  moves  and  turns  (bottom,  identify  “+’  with  ‘O’  and  with  ‘1’). 
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Start : 
Rules : 
F — 
+ — 


F 


> F+L+F-L-F 

> + 

> - 


L — > L 


0:  (#=1) 

F 

1:  (#=9) 

F+L+F-L-F 
2:  (#=33) 

F+L+F-L-F+L+F+L+F-L-F-L-F+L+F-L-F 
3:  (#=105) 

F+L+F-L-F+L+F+L+F-L-F-L-F+L+F-L-F+L+F+L+F-L-F+L+F+L+F-L-F-L-F+L+F-L-F-L-F+L+F-L-F+  . . . 


Figure  1.31-N:  String  substitution  process  for  the  hexdragon. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


The  sequence  is  entry  A080846  in  [312] . it  can  be  generated  via  the  string  substitution  with  rules  0 +->■  101 
and  1 Oil,  see  figure |I~31-M  A fast  method  to  compute  the  sequence  is  based  on  radix-3  counting: 
let  C'i(fc)  be  the  number  of  ones  in  the  radix-3  expansion  of  k,  the  sequence  is  one  if  C\(k  + 1)  < C\{k) 
[FXT:  bits/bit-dragon3.h  : 

static  inline  bool  bit_dragon3_turn(ulong  Sex) 


// 

// 

{ 


II 

II 


Increment  the  radix-3  word  x 
return  whether  the  number  of 


and 

ones 


in  x is  decreased. 


ulong  s = 0; 
while  ( (x  & 3)  ==  2 ) 
if  ( (x  & 3)  ==  0 ) 

if  ( (x  & 3)  ==  1 ) 

bool  tr  = ( (x  & 3)  != 

++x;  //  increment  next  digit 

x «=  (s<<l)  ; //  shift  back 

return  tr; 


{ x »=  2;  ++s ; > 

==>  incremented  word 
==>  incremented  word 
0 );  //  incremented 


//  scan  over  nines 
will  have  one  more  1 
will  have  one  less  1 
word  will  have  one  less 


About  220  million  values  per  second  are  generated.  Two  renderings  of  the  first  729  segments  of  the  curve 


are  shown  in  figure  1.31-K  (created  with  [FXT:  bits/dragon3-texpic-demo.cc 


If  we  replace  each  turn  by  120  degrees  (followed  by  a line)  by  two  turns  by  60  degrees  (each  followed  by  a 


line)  we  obtain  what  may  be  called  a hexdragon , shown  in  figure  1.31-L  (created  with  [FXT:  bits/dragon- 
hex-texpic-demo.ccj).  A string  substitution  process  for  the  hexdragon  is  shown  in  figure  1.31-N 


1.31.5  Dragon  curves  based  on  radix- R counting 

Another  dragon  curve  can  be  generated  on  radix-5  counting  (we  will  call  the  curve  R5-dragon ) [FXT: 
bits/bit-dragon-r5.h  : 

1 static  inline  bool  bit_dragon_r5_turn (ulong  Sex) 

2 II  Increment  the  radix-5  word  x and 

3 //  return  (tr)  whether  the  lowest  nonzero  digit 

4 //  of  the  incremented  word  is  > 2. 

5 { 

6 ulong  s = 0; 

7 while  ( (x  & 7)  ==  4 ) { x »=  3;  ++s;  1 //  scan  over  nines 

8 bool  tr  = ( (x  & 7)  >=  2 ) ; //  whether  digit  will  be  > 2 

9 ++x;  //  increment  next  digit 

10  x «=  (3*s)  ; //  shift  back 

11  return  tr; 

12  } 


About  310  million  values  per  second  are  generated.  The  turns  are  by  90  degrees.  Two  renderings  of  the 
R5-dragon  are  shown  in  figure|1.31-0  (created  with  [FXT:  bits/dragon-r5-texpic-demo.cc|).  The  sequence 
of  returned  values  (entry  A175337  in  I|3l2jl  can  be  computed  via  the  string  substitution  shown  in  figure 
1.31-R  (top). 


Based  on  radix-7  counting  we  can  generate  a curve  that  will  be  called  the  R7-dragon , the  turns  are  be 
120  degrees  [FXT:  bits/bit-dragon-r7.h  : 


1 static  inline  bool  bit _dragon_r7_turn (ulong  Sex) 

2 II  Increment  the  radix-7  word  x and 
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3 //  return  (tr)  whether  the  lowest  nonzero  digit 

4 //  of  the  incremented  word  is  either  2,  3,  or  6. 


1.31:  Some  space  filling  curves  J 
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Start : 0 
Rules : 

0 — > 00110 
1 — > 00111 


0:  (#=1) 

0 

1:  (#=5) 

00110 

2-  (#=25) 

' 0011000110001110011100110 
3:  (#=125) 

00110001100011100111001100011000110001110011100110001100011000111001110011100  \ 
110001100011100111001110011000110001110011100110 


Start : 0 
Rules : 

0 — > 0100110 
1 — > 0110110 


0 

1 

2 

3 


(#=1) 

0 

(#=7) 

0100110 

(#=49) 

0100110011011001001100100110011011001101100100110 

(#=343) 

010011001101100100110010011001101100110110010011001001100110110011011001001  . . . 


Start : 0 
Rules : 

0 — > 0++— 00 
+ — > 0++ — 0+ 
- — > 0++— 0- 


0:  (#=1) 

0 

1:  (#=7) 

0++— 00 
2:  (#=49) 

0++ — 000++ — 0+0++ — 0+0++ — 0-0++ — 0-0++ — 000++ — 00 
3:  (#=343) 

0++ — 000++ — 0+0++ — 0+0++ — 0-0++ — 0-0++ — 000++ — 000++ — 000++ — 0+0++ — 0+0++ — . . . 


Figure  1.31-R:  Turns  of  the  R5-dragon  (top),  the  R7-dragon  (middle),  and  the  second  R7-dragon 


(bottom),  generated  by  string  substitution. 


5 { 


6 

ulong  s = 0; 

7 

while  ( (x  & 7)  ==  6 ) { x »=  3;  ++s 

; ]■  //  scan  over  nines 

8 

++x;  //  increment  next  digit 

9 

bool  tr  = ( x & 2 ) ; //  whether  digit 

is  either  2,  3,  or  6 

10 

x «=  (3*s)  ; //  shift  back 

11 

return  tr; 

12 

13 

} 

1 static  inline  int  bit_dragon_r7_2_turn(ulong  &x) 

2 //  Increment  the  radix-7  word  x and 

3 //  return  (tr)  according  to  the  lowest  nonzero  digit  d 

4 //  of  the  incremented  word: 

5 //  d==  [1,2, 3,4,5, 6]  ==>  rt : = [0,+l, +1,-1, -1,0] 

6 //  (tr  * 120deg)  is  the  turn  with  the  second  R7-dragon. 

7 { 

8 ulong  s = 0; 

9 while  ( (x  & 7)  ==  6 ) { x »=  3;  ++s;  } //  scan  over  nines 

10  ++x;  //  increment  next  digit 

11  int  tr  = 2 - ( (0x2f58  » (2*(x&7))  ) & 3 ); 

12  x «=  (3*s)  ; //  shift  back 

13  return  tr; 

14  } 


The  sequence  of  turns  can  be  generated 
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Chapter  1:  Bit  wizardry 


Start : F 

Rules:  F — > F+F+F-F-F  + — > + > - 

0:  (#=1) 

F 

1 : (#=9) 

F+F+F-F-F 
2:  (#=49) 

F+F+F-F-F+F+F+F-F-F+F+F+F-F-F-F+F+F-F-F-F+F+F-F-F 
3:  (#=249) 

F+F+F-F-F+F+F+F-F-F+F+F+F-F-F-F+F+F-F-F-F+F+F-F-F+F+F+F-F-F+F+F+F-F-F+F+F+F-F-F-F+  . . . 


Start : F 

Rules:  F — > F+F-F-F+F+F-F  + — > + > - 


0:  (#=1) 

F 

1:  (#=13) 

F+F-F-F+F+F-F 
2:  (#=97) 

F+F-F-F+F+F-F+F+F-F-F+F+F-F-F+F-F-F+F+F-F-F+F-F-F+F+F-F+F+F-F-F+F+F-F+F+F-F-F+F+F-  . . . 


Start : F 

Rules:  F — > F0F+F+F-F-F0F  + — > + - — > - 0 — > 0 

0:  (#=1) 

F 

1:  (#=13) 

FOF+F+F-F-FOF 
2:  (#=97) 

F0F+F+F-F-F0F0F0F+F+F-F-F0F+F0F+F+F-F-F0F+F0F+F+F-F-F0F-F0F+F+F-F-F0F-F0F+F+F-F-F0  . . . 


Figure  1.31-S:  String  substitution  processes  for  the  turns  (symbols  *+’  and  and  moves  (symbol  ‘F’ 
is  a unit  move  in  the  current  direction)  of  the  R5-dragon  (top),  the  R7-dragon  (middle),  and  the  second 
R7-dragon  (bottom). 


entry  A176416  in  3 1 2! . 

Two  curves  respectively  based  on  radix-9  and  radix- 13  counting  are  shown  in  figure  [L31-T 
sponding  routines  are  given  in  [FXT:  bits/bit-dragon-r9.h 


The  corre- 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


static  inline  bool  bit_dragon_r9_turn(ulong  Sex) 

//  Increment  the  radix-9  word  x and 

//  return  (tr)  whether  the  lowest  nonzero  digit 

//  of  the  incremented  word  is  either  2,  3,  5,  or  8. 

//  tr  determines  whether  to  turn  left  or  right  (by  120  degrees) 

//  with  the  R9-dragon  fractal. 

//  The  sequence  tr  is  the  fixed  point 

//  of  the  morphism  0 I — > 011010010,  1 I — > 011010011. 

//  Also  fixed  point  of  morphism  (identify  + with  0 and  - with  1) 

//  F |— > F+F-F-F+F-F+F+F-F,  + |— > +,  - | — > - 
//  Also  fixed  point  of  morphism 

//  F |— > G+G-G,  G |— > F-F+F,  + I — > +,  - I — > - 

{ 

ulong  s = 0; 

while  ( (x  & 15)  ==  8 ) { x >>=  4;  ++s ; } //  scan  over  nines 

++x;  //  increment  next  digit 

bool  tr  = ( (0x12c  >>  (x&15))  & 1 );  //  whether  digit  is  either  2,  3,  5,  or  8 

x «=  (4*s)  ; //  shift  back 

return  tr; 

} 


and  [FXT:  bits/bit-dragon-rl3.h| 

1 static  inline  bool  bit_dragon_rl3_turn(ulong  &x) 

2 //  Increment  the  radix-13  word  x and 

3 //  return  (tr)  whether  the  lowest  nonzero  digit 

4 //  of  the  incremented  word  is  either  3,  6,  8,  9,  11,  or  12. 

5 //  tr  determines  whether  to  turn  left  or  right  (by  90  degrees) 

6 //  with  the  R13-dragon  fractal. 

7 //  The  sequence  tr  is  the  fixed  point 

8 //  of  the  morphism  0 I — > 0010010110110,  1 I — > 0010010110111. 

9 //  Also  fixed  point  of  morphism  (identify  + with  0 and  - with  1) 

10  //  F |— > F+F+F-F+F+F-F+F-F-F+F-F-F,  + |— > +,  - | — > - 

11  { 

12 


ulong  s = 0; 
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Chapter  2 

Permutations  and  their  operations 


We  study  permutations  together  with  the  operations  on  them,  like  composition  and  inversion.  We 
further  discuss  the  decomposition  of  permutations  into  cycles  and  give  methods  for  generating  random 
permutations,  cyclic  permutations,  involutions,  and  derangements.  In-place  algorithms  for  applying 
several  special  permutations  like  the  revbin  permutation,  the  Gray  permutation,  and  matrix  transposition 
are  given. 

Algorithms  for  the  generation  of  all  permutations  of  a given  number  of  objects  and  bijections  between 
permutations  and  mixed  radix  numbers  in  factorial  base  are  given  in  chapter  |l0| 

2.1  Basic  definitions  and  operations 

A permutation  of  n elements  can  be  represented  by  an  array  X = [xq,  Xi,  . . . , xn-i].  When  the  permu- 
tation X is  applied  to  F = [fo,  fi,  • • ■ , fn-i],  then  the  element  at  position  k is  moved  to  position  Xk ■ A 
routine  for  the  operation  is  [FXT:  perm/permapply.h  : 

1 template  Ctypename  Type> 

2 void  apply_permutation(const  ulong  *x,  const  Type  *f , Type  * restrict  g,  ulong  n) 

3 //  Apply  the  permutation  x []  to  the  array  f []  , 

4 //  i.e.  set  g[x[k]]  < — f [k]  for  all  k 

5 { 

6 for  (ulong  k=0;  k<n;  ++k)  g[x[k]]  = f [k]  ; 

7 } 

Routines  to  test  various  properties  of  permutations  are  given  in  [FXT:  perm/permq.cc  . The  length- 
n sequence  [0,  1,  2,  . . . , n — 1]  represents  the  identical  permutation  which  leaves  all  elements  in  their 
position.  To  check  whether  a given  permutation  is  the  identity  is  trivial: 

1 bool  is_identity (const  ulong  *f , ulong  n) 

2 //  Return  whether  f []  is  the  identical  permutation, 

3 //  i.e.  whether  f [k] ==k  for  all  k=  0...n-l 

4 { 

5 for  (ulong  k=0;  k<n;  ++k)  if  ( f [k]  !=  k ) return  false; 

6 return  true ; 

7 } 

A fixed  point  of  a permutation  is  an  index  where  the  element  is  not  moved: 

ulong  count_fixed_points (const  ulong  *f , ulong  n) 

//  Return  number  of  fixed  points  in  f [] 

{ 

ulong  ct  = 0; 

for  (ulong  k=0;  k<n;  ++k)  ct  +=  ( f [k]  ==  k ) ; 
return  ct ; 

} 

A derangement  is  a permutation  that  has  no  fixed  points.  A routine  to  check  whether  a permutation  is 
a derangement  is 

1 bool  is_derangement (const  ulong  *f,  ulong  n) 

2 //  Return  whether  f []  is  a derangement  of  identity, 

3 //i.e.  whether  f [k] ! =k  for  all  k 

4 { 

5 for  (ulong  k=0;  k<n;  ++k)  if  ( f [k]  ==  k ) return  false; 


2.1:  Basic  definitions  and  operations 


103 


6 return  true ; 

7 } 

Whether  two  arrays  are  mutual  derangements  (that  is,  /*.  ^ g & for  all  k)  can  be  determined  by: 

1 bool  is_derangement (const  ulong  *f , const  ulong  *g,  ulong  n) 

2 //  Return  whether  f[]  is  a derangement  of  g[], 

3 //  i.e.  whether  f[k]!=g[k]  for  all  k 

4 { 

5 for  (ulong  k=0;  k<n;  ++k)  if  ( f [k]  ==  g[k]  ) return  false; 

6 return  true ; 

7 } 

A connected  (or  indecomposable)  permutation  contains  no  proper  prefix  mapped  to  itself.  We  test  whether 
max(/0,  /i,  . . . , fk)  > k for  all  k < n - 1: 

1 bool 

2 is_connected(const  ulong  *f , ulong  n) 

3 { 

4 if  ( n<=l  ) return  true; 

5 ulong  m = 0;  //  maximum 

6 for  (ulong  k=0;  k<n-l;  ++k)  //  for  all  proper  prefixes 

7 { 

8 const  ulong  fk  = f [k] ; 

9 if  ( fk>m  ) m = fk; 

10  if  ( m<=k  ) return  false; 

11  > 

12  return  true; 

13  } 

To  check  whether  an  array  is  a valid  permutation,  we  need  to  verify  that  each  index  in  the  valid  range 
appears  exactly  once.  The  bit-array  described  in  section  |4.6  on  page  164|  allows  doing  the  job  without 
modifying  the  input: 

1 bo 

2 is 

3 // 

4 // 

5 { 

6 

7 

8 
9 

10 
11 
12 

11 

15 

16 

17 

18 

19 

20  } 

The  complement  of  a permutation  is  computed  by  replacing  every  element  v by  n — 1 — v [FXT: 
perm/permcomplement.h  : 

1 inline  void  make_complement (const  ulong  *f , ulong  *g,  ulong  n) 

2 //  Set  (as  permutation)  g to  the  complement  of  f. 

3 //  Can  have  f==g. 

4 1 

5 for  (ulong  k=0;  k<n;  ++k)  g[k]  = n - 1 - f [k] ; 

6 } 

The  reversal  of  a permutation  is  simply  the  reversed  array  [FXT:  perm/reverse. h : 

1 template  Ctypename  Type> 

2 inline  void  reverse (Type  *f , ulong  n) 

3 //  Reverse  order  of  array  f . 

4 { 

5 for  (ulong  k=0,  i=n-l;  k<i;  ++k,  — i)  swap2(f[k],  f [i] ) ; 

6 } 


,va.lid_permutation(const  ulong  *f , ulong  n,  bitarray  *bp/*=0*/) 
Return  whether  all  values  0...n-l  appear  exactly  once, 
i.e.  whether  f represents  a permutation  of  [0 , 1 , . . . ,n-l] . 

//  check  whether  any  element  is  out  of  range: 

for  (ulong  k=0;  k<n;  ++k)  if  ( f [k] >=n  ) return  false; 

//  check  whether  values  are  unique: 
bitarray  *tp  = bp; 

if  ( 0==bp  ) tp  = new  bitarray(n);  //  tags 
tp->clear_all () ; 

ulong  k; 

for  (k=0;  k<n;  ++k)  if  ( tp->test_set (f  [k] ) ) break; 
if  ( 0==bp  ) delete  tp; 
return  (k==n) ; 
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2.2  Representation  as  disjoint  cycles 


Every  permutation  consists  entirely  of  disjoint  cycles.  A cycle  of  a permutation  is  a subset  of  the  indices 
that  is  rotated  (by  one  position)  by  the  permutation.  The  term  disjoint  means  that  the  cycles  do  not 
‘cross’  each  other.  While  this  observation  may  appear  trivial  it  gives  a recipe  for  many  operations:  follow 
the  cycles  of  the  permutation,  one  by  one,  and  do  the  necessary  operation  on  each  of  them. 

Consider  the  following  permutation  of  length  8: 

[ 0,  2,  4,  6,  1,  3,  5,  7 ] 

There  are  two  fixed  points  (0  and  7,  which  we  omit)  and  these  cycles: 

( 1 — > 2 — > 4 ) 

( 3 — > 6 — > 5 ) 

The  cycles  do  ‘wrap  around’,  for  example,  the  final  4 of  the  fist  cycle  goes  to  position  1,  the  first  element 
of  the  cycle.  The  inverse  permutation  is  found  by  reversing  every  arrow  in  each  cycle: 

( 1 < — 2 <—  4 ) 

( 3 <—  6 <—  5 ) 

Equivalently,  we  can  reverse  the  order  of  the  elements  in  each  cycle: 

( 4 — > 2 — > 1 ) 

( 5 — > 6 — > 3 ) 

If  we  begin  each  cycle  with  its  smallest  element,  the  inverse  permutation  is  written  as 

( 1 — > 4 — > 2 ) 

( 3 — > 5 — > 6 ) 

This  form  is  obtained  by  reversing  all  elements  except  the  first  in  each  cycle  of  the  (forward)  permutation. 
The  last  three  sets  of  cycles  all  describe  the  same  permutation,  it  is 

[ 0,  4,  1,  5,  2,  6,  3,  7 ] 


Permutation: 

[02461357] 

Inverse : 

[04152637] 

Cycles : 

(0)  #=1 

(1,  2,  4)  #=3 

(3,  6,  5)  #=3 

(7)  #=1 

Code : 

template  <typename  Type> 

inline  void  foo  perm  8(Type  *f) 

{ Type  t=f  [1]  ; f[l]=f[4];  f [4]  =f  [2]  ; 

f [2] =t ; } 

{ Type  t=f  [3]  ; f [3]  =f  [5]  ; f [5]  =f  [6]  ; 

f [6] =t ; } 

> 

Figure  2.2-A:  A permutation  of  8 elements,  its  inverse,  its  cycles,  and  code  for  the  permutation. 


The  cycles  form  of  a permutation  can  be  printed  with  [FXT:  perm/printcycles.cc 

1 void 

2 print_cycles (const  ulong  *f , ulong  n,  bitarray  *tb/*=0*/) 

3 //  Print  cycle  form  of  the  permutation  in  f []  . 

4 //  Examples  (first  permutations  of  4 elements  in  lex  order) : 


5 

// 

array 

form 

cycle 

form 

6 

// 

0 

[ 0 

1 

2 

3 

] 

(0) 

(1) 

(2)  (3) 

7 

// 

1 

[ 0 

1 

3 

2 

] 

(0) 

(1) 

(2,  3) 

8 

// 

2 

[ 0 

2 

1 

3 

] 

(0) 

(1. 

2)  (3) 

9 

// 

3 

[ 0 

2 

3 

1 

] 

(0) 

(1, 

2,  3) 

10 

// 

4 

[ 0 

3 

1 

2 

] 

(0) 

(1, 

3,  2) 

11 

// 

5 

[ 0 

3 

2 

1 

] 

(0) 

(1, 

3)  (2) 

12 

// 

6 

[ 1 

0 

2 

3 

] 

(0, 

1) 

(2)  (3) 

13 

// 

7 

[ 1 

0 

3 

2 

] 

(0, 

1) 

(2,  3) 

14 

// 

8 

[ 1 

2 

0 

3 

] 

(0, 

1. 

2)  (3) 

15 

{ 

16 

bitarray 

*b 

tb; 
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17 

if 

( tb==0  ) b = new 

bitarray(n) ; 

18 

b->clear  all(); 

19 

20 

for 

(ulong  k=0;  k<n; 

++k) 

21 

22 

if  ( b->test(k)  ) 

continue;  // 

already  processed 

23 

24 

cout  « " ( " ; 

25 

ulong  i = k;  // 

next  in  cycle 

26 

const  char  *cm  = 

II  II  . 

27 

do 

28 

i 

29 

cout  « cm  << 

i ; 

30 

cm  = " , " ; 

31 

b->set (i)  ; 

32 

} 

33 

while  ( ( i=f  [i] ) 

!=  k );  //  until  we  meet  cycle 

34 

cout  « " ) " ; 

35 

> 

36 

37 

if 

( tb==0  ) delete 

b; 

38  > 

The  bit-array  (see  section  4.6  on  page  164  for  the  implementation)  is  used  to  keep  track  of  the  elements 
already  processed.  The  routine  can  be  modified  to  generate  code  for  applying  a given  permutation  to 
an  array.  The  program  [FXT:  perm/cycles-demo.cc|  prints  cycles  and  code  for  a permutation,  see  figure 
\TTA\ 


2.2.1  Cyclic  permutations 

A permutation  consisting  of  exactly  one  cycle  is  called  cyclic.  Whether  a given  permutation  has  this 
property  can  be  tested  with  [FXT:  perm/permq.cc  : 

1 bool 

2 is_cyclic (const  ulong  *f , ulong  n) 

3 //  Return  whether  permutation  is  exactly  one  cycle. 

4 { 

5 if  ( n<=l  ) return  true; 

6 ulong  k = 0,  e = 0; 

7 do  { e=f  [e] ; ++k;  } while  ( e!=0  ); 

8 return  (k==n) ; 

9 } 

The  method  used  is  to  follow  the  cycle  starting  at  position  zero  and  counting  how  long  it  is.  If  the  length 
found  equals  the  array  length,  then  the  permutation  is  cyclic.  There  are  (n  — 1)!  cyclic  permutations  of 
n elements. 


2.2.2  Sign  and  parity  of  a permutation 

Every  permutation  can  be  written  as  a composition  of  transpositions  (cycles  of  length  2).  This  number 
of  transpositions  is  not  unique,  but  modulo  2 it  is  unique.  The  sign  of  a permutation  is  defined  to  be 
+1  if  the  number  is  even  and  —1  if  the  number  is  odd.  The  minimal  number  of  transpositions  whose 
composition  give  a cycle  of  length  Z is  Z — 1.  So  the  minimal  number  of  transpositions  for  a permutation 
consisting  of  k cycles  where  the  length  of  the  j- th  cycle  is  lj  equals  Qj  ~ 1)  = QZjL i h)  ~ The 

transposition  count  modulo  2 is  called  the  parity  of  a permutation. 


2.3  Compositions  of  permutations 

We  can  apply  several  permutations  to  an  array,  one  by  one.  The  resulting  permutation  is  called  the 
composition  of  the  applied  permutations.  The  operation  of  composition  is  not  commutative:  in  general 
/ ' 9 7^  9 ’ / f°r  / 7^  9-  We  note  that  the  permutations  of  n elements  form  a group  (of  n!  elements),  the 
group  operation  is  composition. 
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2.3.1  The  inverse  of  a permutation 

A permutation  / is  the  inverse  of  the  permutation  g if  it  undoes  its  effect:  / • g = id.  A test  whether 
two  permutations  / and  g are  mutual  inverses  is 

1 bool  is_inverse (const  ulong  *f , const  ulong  *g,  ulong  n) 

2 //  Return  whether  f []  is  the  inverse  of  g[] 

3 { 

4 for  (ulong  k=0;  k<n;  ++k)  if  ( f[g[k]]  !=  k ) return  false; 

5 return  true ; 

6 } 

We  have  g ■ f = f ■ g = id,  in  a group  the  left-inverse  is  equal  to  the  right-inverse,  so  we  can  simply  call 
g ‘the  inverse’  of  /. 

A permutation  which  is  its  own  inverse  is  called  an  involution.  Checking  for  this  is  easy: 

bool  is_involution(const  ulong  *f , ulong  n) 

//  Return  whether  max  cycle  length  is  <=  2, 

//  i.e.  whether  f * f = id. 

{ 

for  (ulong  k=0;  k<n;  ++k)  if  ( f [f  [k] ] !=  k ) return  false; 

return  true ; 

} 

The  following  routine  computes  the  inverse  of  a given  permutation  [FXT:  perm/perminvert.cc  : 

1 void  make_inverse (const  ulong  *f , ulong  * restrict  g,  ulong  n) 

2 //  Set  (as  permutation)  g to  the  inverse  of  f 

3 { 

4 for  (ulong  k=0;  k<n;  ++k)  g[f[k]]  = k; 

5 } 

For  the  in-place  computation  of  the  inverse  we  have  to  reverse  each  cycle  [FXT:  perm/perminvert.cc  : 

void  make_inverse (ulong  *f , ulong  n,  bitarray  *bp/*=0*/) 

//  Set  (as  permutation)  f to  its  own  inverse. 

{ 

bitarray  *tp  = bp; 

if  ( 0==bp  ) tp  = new  bitarray(n);  //  tags 
tp->clear_all () ; 

for  (ulong  k=0;  k<n;  ++k) 

■C 

if  ( tp->test_clear (k)  ) continue;  //  already  processed 
tp->set (k) ; 

//  invert  a cycle: 
ulong  i = k; 

ulong  g = f [i] ; / / next  index 

while  ( 0==(tp->test_set (g) ) ) 

{ 

ulong  t = f [g]  ; 

ffgl  = i; 
i = g; 
g = t; 

} 

fig]  = i; 

} 

if  ( 0==bp  ) delete  tp; 

} 

The  extra  array  of  tag-bits  can  be  avoided  by  using  the  highest  bit  of  each  word  as  a tag-bit.  The  scheme 
would  fail  if  any  word  of  the  permutation  array  had  the  highest  bit  set.  However,  on  byte- addressable 
machines  such  an  array  will  not  fit  into  memory  (for  word  sizes  of  16  or  more  bits).  To  keep  the  code 
similar  to  the  version  using  the  bit-array,  we  define 

1 static  const  ulong  si  = 1UL  <<  (BITS_PER_L0NG  - 1) ; //  highest  bit  is  tag-bit 

2 static  const  ulong  sO  = ~sl;  //  all  bits  but  tag-bit 

3 

4 static  inline  void  SET (ulong  *f , ulong  k)  { f [k&sO]  |=  si;  } 

5 static  inline  void  CLEAR(ulong  *f , ulong  k)  { f [k&sO]  &=  sO;  } 

6 static  inline  bool  TEST(ulong  *f , ulong  k)  { return  (0 ! = (f  [k&sO] &sl) ) ; } 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


27 


1 

2 

3 

4 

5 

6 
7 
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We  have  to  mask  out  the  tag-bit  when  using  the  index  variable  k.  The  routine  can  be  implemented  as 

1 void 

2 make_inverse(ulong  *f , ulong  n) 

3 //  Set  (as  permutation)  f to  its  own  inverse. 

4 //  In-place  version  using  highest  bits  of  array  as  tag-bits. 

5 { 


continue;  } //  already  processed 


6 

for 

(ulong  k=0;  k<n;  ++k) 

7 

1 

8 

if  ( TEST(f , k)  ) { CLEAR (f , k) ; 

9 

SET (f , k); 

10 

11 

//  invert  a cycle: 

12 

ulong  i = k; 

13 

ulong  g = f [i] ; / / next  index 

14 

while  ( 0==TEST(f , g)  ) 

15 

{ 

16 

ulong  t = f [g]  ; 

17 

f[g]  = i; 

18 

SET(f , g); 

19 

i = g; 

20 

g = t; 

21 

} 

22 

f[g]  = i; 

23 

24 

CLEAR(f,  k) ; //  leave  no  tag-bit 

25 

} 

26  } 

The  extra  CLEAR  ()  statement  at  the  end  removes  the  tag-bit  of  the  cycle  minima.  Its  effect  is  that 
no  tag-bits  are  set  after  the  routine  has  finished.  This  routine  has  about  the  same  performance  as  the 
bit-array  version. 


2.3.2  The  square  of  a permutation 


The  square  of  a permutation  is  the  composition  with  itself.  The  routine  for  squaring  is  [FXT: 
perm/permcompose.cc 

1 void  make_square (const  ulong  *f , ulong  * restrict  g,  ulong  n) 

2 //  Set  (as  permutation)  g = f * f 

3 { 

4 for  (ulong  k=0;  k<n;  ++k)  g[k]  = f [f  [k]  ] ; 

5 } 

The  in-place  version  is 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 


void  make_square (ulong  *f , ulong  n,  bitarray  *bp/*=0*/) 

//  Set  (as  permutation)  f = f * f 
//  In-place  version. 

{ 

bitarray  *tp  = bp; 

if  ( 0==bp  ) tp  = new  bitarray(n);  //  tags 
tp->clear_all () ; 

for  (ulong  k=0;  k<n;  ++k) 

{ 

if  ( tp->test_clear (k)  ) continue;  //  already  processed 
tp->set (k) ; 

//  square  a cycle : 
ulong  i = k; 

ulong  t = f [i] ; //  save 
ulong  g = f [i] ; / / next  index 
while  ( 0==(tp->test_set (g) ) ) 

{ 

f [i]  = f [g]  ; 

1 = g; 
g = f [g] ; 

} 

f[i]  = t; 

1 

if  ( 0==bp  ) delete  tp; 
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2.3.3  Composing  and  powering  permutations 

The  composition  of  two  permutations  can  be  computed  as 

1 void 

2 compose(const  ulong  *f , const  ulong  *g,  ulong  * restrict  h,  ulong  n) 

3 //  Set  (as  permutation)  h = f * g 

4 { " . ~ 

5 for  (ulong  k=0;  k<n;  ++k)  h[k]  = f[g[k]]; 

6 } 

The  following  version  will  be  used  in  the  powering  routine  for  permutations: 

1 void 

2 compose (const  ulong  *f , ulong  * restrict  g,  ulong  n) 

3 //  Set  (as  permutation)  g = f * g 

4 { 

5 for  (ulong  k=0;  k<n;  ++k)  g[k]  = f [g [k] ] ; //  yes,  this  works 

6 } 

The  e-th  power  of  a permutation  / is  computed  (and  returned  in  g)  by  a version  of  the  binary  exponen- 
tiation algorithm  described  in  section  28.5  on  page  563  [FXT:  perm/permcompose.cc  : 

1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 
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17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 


void 

power (const  ulong  *f , ulong  * restrict  g,  ulong  n,  long  e, 
ulong  * restrict  t/*=0*/) 

//  Set  (as  permutation)  g = f **  e 

{ 

if  ( e==0  ) 

■C 

for  (ulong  k=0;  k<n;  ++k)  g[k]  = k; 
return; 

} 

if  ( e==l  ) 

acopy (f , g,  n) ; 
return; 

> 

if  ( e==-l  ) 

make_inverse(f , g,  n) ; 
return; 

> 


//  here: 
ulong  x = 


abs(e)  > 1 
e>0  ? e : -e ; 


if  ( is_pow_of_2(x)  ) //  special  case  x==2~n 

{ 

make_square (f , g,  n) ; 
while  ( x>2  ) { make_square(g,  n) ; 

> 

else 

ulong  *tt  = t ; 

if  ( 0==t  ) { tt  = new  ulong [n] ; } 

acopy (f , tt , n) ; 


x /=  2;  > 


39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 


int  firstq 
while  ( 1 ) 

{ 

if  ( x& 1 ) 

{ 


1; 


//  odd 


if  ( firstq  ) //  avoid  multiplication  by  1 

{ 

acopy (tt , g,  n) ; 
firstq  = 0; 

> 

else  compose (tt,  g,  n) ; 
if  ( x==l  ) goto  dort ; 


2.4:  In-place  methods  to  apply  permutations  to  data 
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53 

54 

55 


If 


58 

59 

60 
61 
62 


1 

2 

3 

4 

5 

6 
7 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 
29 


1 

2 

3 

4 

5 


1 

2 

3 

4 


make_square (tt , n) ; 
x /=  2; 

} 

dort : 

if  ( 0==t  ) delete  []  tt ; 

> 

if  ( e<0  ) make_inverse(g,  n) ; 

} 

The  routine  involves  O (n  log(n))  operations.  By  extracting  the  cycles  of  the  permutation,  computing 
their  e-th  powers,  and  copying  them  back,  we  could  reduce  the  complexity  to  only  O(n).  The  e-th  power 
of  a cycle  is  a cyclic  shift  by  e positions,  as  described  in  section  [279]  on  page|123[ 


2.4  In-place  methods  to  apply  permutations  to  data 


We  repeat  the  routine  for  applying  a permutation  [FXT:  perm/permapply.h  : 

template  Ctypename  Type> 

void  apply_permutation(const  ulong  *x,  const  Type  *f , Type  * restrict  g,  ulong  n) 

//  Apply  the  permutation  x []  to  the  array  f []  , 

//  i.e.  set  g[x[k]]  < — f [k]  for  all  k 

{ 

for  (ulong  k=0;  k<n;  ++k)  g[x[k]]  = f [k]  ; 

} 

The  in-place  version  follows  the  cycles  of  the  permutation: 

template  Ctypename  Type> 

void  apply_permutation(const  ulong  *x,  Type  * restrict  f,  ulong  n,  bitarray  *bp=0) 

{ 

bitarray  *tp  = bp; 

if  ( 0==bp  ) tp  = new  bitarray(n);  //  tags 
tp->clear_all () ; 

for  (ulong  k=0;  k<n;  ++k) 

if  ( tp->test_clear (k)  ) continue;  //  already  processed 
tp->set (k) ; 

//  do  cycle:  

ulong  i = k;  //  start  of  cycle 
Type  t = f [i]  ; 
ulong  g = x [i]  ; 

while  ( 0==(tp->test_set (g) ) ) //  cf.  gray_permute() 

{ 

Type  tt  = f [g]  ; 
f[g]  = t; 
t = tt ; 
g = x [g]  ; 

} 

f[g]  = t; 

//  end  (do  cycle)  

} 

if  ( 0==bp  ) delete  tp; 

} 

To  apply  the  inverse  of  a permutation  without  inverting  the  permutation  itself,  use 

template  Ctypename  Type> 

void  apply_inverse_permutation(const  ulong  *x,  const  Type  *f,  Type  * restrict  g,  ulong  n) 

{ 

for  (ulong  k=0;  kCn;  ++k)  g[k]  = f [x  [k]  ] ; 

} 

The  in-place  version  is 

template  Ctypename  Type> 

void  apply_inverse_permutation(const  ulong  *x,  Type  * restrict  f,  ulong  n,  bitarray  *bp=0) 

{ 

bitarray  *tp  = bp; 
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5 

if 

( 0==bp  ) tp  = new  bitarray(n); 

6 

7 

tp- 

>clear_all () ; 

8 

f or 

(ulong  k=0;  k<n;  ++k) 

9 

10 

if  ( tp->test_clear (k)  ) contii 

11 

12 

tp->set (k) ; 

13 

//  do  cycle:  

14 

ulong  i = k;  //  start  of  cycle 

15 

Type  t = f [i]  ; 

16 

ulong  g = x [i]  ; 

17 

while  ( 0==(tp->test  set(g))  ) 

18 

{ 

19 

f [i]  = f [g]  ; 

20 

i = g; 

21 

g = x[i]  ; 

22 

} 

23 

f [i]  = t; 

24 

//  end  (do  cycle)  

25 

26 

> 

27 

28  } 

if 

( 0==bp  ) delete  tp; 

A permutation  of  n elements  can  be  given  as  a function  X(k ) (where  0 < X(k)  <=  n for  0 < k < n,  and 
X(i)  ^ X (j ) for  i ^ j).  The  permutation  given  as  function  X can  be  applied  to  an  array  / via  [FXT: 
perm/permapplyfunc.h  : 

1 template  Ctypename  Type> 

2 void  apply_permutation(ulong  (*x) (ulong) , const  Type  *f , Type  * restrict  g,  ulong  n) 

3 //  Set  g[x(k)]  < — f [k]  for  all  k 

4 { 

5 for  (ulong  k=0;  k<n;  ++k)  g[x(k)]  = f [k] ; 

6 } 


For  example,  the  following  statements  are  equivalent: 

apply _permutation(gray_code , f,  g,  n) ; 
gray_permute (f , g,  n) ; 

The  inverse  routine  is 

1 template  Ctypename  Type> 

2 void  apply_inverse_permutation(ulong  (*x) (ulong) , const  Type  *f , Type  * restrict  g,  ulong  n) 

3 { 

4 for  (ulong  k=0;  k<n;  ++k)  g[k]  = f[x(k)]; 

5 } 


The  in-place  versions  of  these  routines  are  almost  identical  to  the  routines  that  apply  permutations  given 
as  arrays.  Only  a tiny  change  must  be  made  in  the  processing  of  the  cycles.  For  example,  the  fragment 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


void  apply_permutation(const  ulong  *x,  Type  * restrict  f,  ulong  n,  bitarray  *bp=0) 
[ — snip — ] 

ulong  i = k;  //  start  of  cycle 
Type  t = f [i]  ; 
ulong  g = x [i]  ; 

while  ( 0==(tp->test_set (g) ) ) //  cf.  gray_permute() 

{ 

Type  tt  = f [g]  ; 
f[g]  = t; 
t = tt ; 
g = x [g]  ; 

} 

fig]  = t; 

[ — snip — ] 


must  be  modified  by  replacing  all  occurrences  of  ‘x  [i]  ’ with  ‘x(i)’: 


1 void  apply_permutation(ulong  (*x) (ulong) , Type  *f , ulong  n,  bitarray  *bp=0) 

2 [ — snip — ] 

3 ulong  i = k;  //  start  of  cycle 

4 Type  t = f [i]  ; 

5 ulong  g = x(i) ; //  < — = 

6 while  ( 0==(tp->test_set (g) ) ) //  cf.  gray_permute() 


2.5:  Random  permutations 


111 


7 

8 
9 

10 

11 

12 

13 

14 


Type  tt  = f [g]  ; 
f[g]  = t; 
t = tt ; 

g = x (g)  ; //  <— = 

} 

f[g]  = t; 

[ — snip — ] 


2.5  Random  permutations 


The  following  routine  randomly  permutes  an  array  with  arbitrary  elements  [FXT:  perm/permrand.h  : 

1 template  Ctypename  Type> 

2 void  random_permute (Type  *f , ulong  n) 

3 { 

4 for  (ulong  k=n;  k>l;  — k) 

5 { 

6 const  ulong  i = rand_idx(k); 

7 swap2(f  [k-1]  , f [i] ) ; 

8 > 

9 } 

An  alternative  version  for  the  loop  is: 

1 for  (ulong  k=l;  k<n;  ++k) 

2 { 

3 const  ulong  i = rand_idx(k+l) ; 

4 swap2(f  [k]  , f [i]  ) ; 

5 > 

The  method  is  given  in  [1321.  it  is  sometimes  called  Knuth  shuffle  or  Fisher-Yates  shuffle,  see  12131  alg.P, 
sect. 3. 4. 2],  We  use  the  auxiliary  routine  [FXT:  auxO/rand-idx.h 

1 inline  ulong  rand_idx (ulong  m) 

2 //  Return  random  number  in  the  range  [0,  1,  . . . , m-1] . 

3 //  Must  have  m>0. 

4 { 

5 if  ( m==l  ) return  0;  //  could  also  use  °/0  1 

6 ulong  x = (ulong)randO  ; 

7 x ~=  x>>16;  //  avoid  using  low  bits  of  randO  alone 

8 return  x 7.  m; 

9 } 

A random  permutation  is  computed  by  applying  the  function  to  the  identical  permutation: 

1 void  random_permutation(ulong  *f , ulong  n) 

2 //  Create  a random  permutation 

3 { 

4 for  (ulong  k=0;  k<n;  ++k)  f [k]  = k; 

5 random_permute(f , n) ; 

6 } 

A slight  modification  of  the  underlying  idea  can  be  used  for  a routine  for  random  selection  from  a list 
with  only  one  linear  read.  Let  I be  a list  of  n items  L±,  . . . , Ln. 

1.  Set  t = L\,  set  k = 1. 

2.  Set  k = k + 1.  If  fc  > n return  t. 

3.  With  probability  1/k  set  t = L^. 

4.  Go  to  step  2. 

Note  that  one  does  not  need  to  know  n,  the  number  of  elements  in  the  list,  in  advance:  replace  the  second 
statement  in  step  2 by  “If  there  are  no  more  elements,  return  t” . 
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2.5.1  Random  cyclic  permutation 


A routine  to  apply  a random  cyclic  permutation  (as  defined  in  section  2.2.1  on  page  105)  to  an  array  is 
[FXT:  perm/permrand-cyclic.h 


1 template  Ctypename  Type> 

2 void  random_permute_cyclic (Type  *f,  ulong  n) 

3 //  Permute  the  elements  of  f by  a random  cyclic  permutation. 

4 { 

5 for  (ulong  k=n-l;  k>0;  — k) 

6 { 

7 const  ulong  i = rand_idx(k); 

8 swap2(f[k],  f [i]  ) ; 

9 > 

10  } 


The  method  is  called  Sattolo’s  algorithm , see  I23HI.  and  also  m and  |362j.  It  can  be  described  as  a 
method  to  arrange  people  in  a cycle:  Assume  there  are  n people  in  a room.  Let  the  first  person  choose 
a successor  out  of  the  remaining  persons  not  yet  chosen.  Then  let  the  person  just  chosen  make  the  next 
choice  of  a successor.  Repeat  until  everyone  has  been  chosen.  Finally,  let  the  first  person  be  the  successor 
of  the  last  person  chosen. 

The  cycle  representation  of  a random  cyclic  permutation  can  be  computed  by  applying  a random  per- 
mutation to  all  elements  (of  the  identical  permutation)  except  for  the  first  element. 


2.5.2  Random  prefix  of  a permutation 

A length-m  prefix  of  a random  permutation  of  n elements  is  computed  by  the  following  routine  that  uses 
just  0(m ) operations  [FXT:  perm/permrand-pref.h  : 

1 template  Ctypename  Type> 

2 void  random_permute_pref (Type  *f , ulong  n,  ulong  m) 

3 //  Set  the  first  m elements  to  a prefix  of  a random  permutation. 

4 //  Same  as:  set  the  first  m elements  of  f to  a random  permutation 

5 //  of  a random  selection  of  all  n elements. 

6 //  Must  have  m<=n-l. 

7 //  Same  as  random_permute()  if  m>=n-l. 

8 { 

9 if  ( m>n-l  ) m = n-1;  //  m>n  is  not  admissable 


10 

for 

(ulong  k=0,j=n;  kCm;  ++k, — j) 

11 

12 

const  ulong  i = k + rand_idx(j); 

13 

swap2(f  [k]  , f [i] ) ; 

14 

} 

15  } 

The  first  element  is  randomly  selected  from  all  n elements,  the  second  from  the  remaining  n—1  elements, 
and  so  on.  Thus  there  are  n (n  — 1)  ...  (n  — m + 1)  = n!/(n  — to)!  length-m  prefixes  of  permutations  of 
n elements. 


2.5.3  Random  permutation  with  prescribed  parity 


2.2.2 


To  compute  a random  permutation  with  prescribed  parity  (as  defined  in  section  2.2.2  on  page  105|)  we 
keep  track  of  the  parity  of  the  generated  permutation  and  change  it  via  a single  transposition  if  necessary 
[FXT:  perm/permrand-parity.h  : 
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template  Ctypename  Type> 

void  random_permute_parity (Type  *f , ulong  n,  bool  par) 

//  Randomly  permute  the  elements  of  f , such  that  the 
//  parity  of  the  permutation  equals  par. 

//  I.e.  the  minimal  number  of  transpositions  of  the 
//  permutation  is  even  if  par==0,  else  odd. 

//  Note:  with  nC=l  there  is  no  odd  permutation. 

{ 

if  ( (par==l)  &&  (nC2)  ) return;  //  not  admissable 

bool  pr  = 0;  //  identity  has  even  parity 

for  (ulong  k=l;  kCn;  ++k) 


2.5:  Random  permutations 


113 


13 

14 

const  ulong 

i = rand  idx(k+l) ; 

15 

swap2(f  [k]  , 

f [i] ) ; 

16 

pr  ~=  ( k ! = 

= i );  //  parity 

changes  with 

swap 

17 

} 

18 

19 

20  } 

if 

( par ! =pr  ) 

swap2(f  [0]  , f [1]  ) ; 

//  need  to 

change  parity 

2.5.4  Random  permutation  with  m smallest  elements  in  prescribed  order 


In  the  last  algorithm  we  conditionally  changed  the  positions  0 and  1.  Now  we  conditionally  change  the 
elements  0 and  1 to  preserve  their  relative  order  [FXT:  perm/permrand-ord.h  : 
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template  Ctypename  Type> 

void  random_ord01 .permutation (Type  *f , ulong  n) 

//  Random  permutation  such  that  elements  0 and  1 are  in  order. 

{ 


random_permutation(f , n) ; 

ulong  t = 0; 

while  ( f[t]>l  ) ++t ; 

if  ( f[t]==0  ) return;  //  already  in  correct  order 
f [t]  = 0; 

do  { ++t ; y while  ( f [t] ! =0  ) ; 
fit]  = 1 


} 


The  routine  generates  half  of  all  the  permutations  but  not  their  reversals.  The  following  routine  fixes  the 
relative  order  of  the  to  smallest  elements: 

1 template  Ctypename  Type> 

2 void  random_ordm_permutation(Type  *f , ulong  n,  ulong  m) 

3 //  Random  permutation  such  that  the  m smallest  elements  are  in  order. 

4 //  Must  have  m<=n. 

5 { 

6 random_permutation(f , n) ; 

7 for  (ulong  t=0,j=0;  j<m;  ++t)  if  ( f[t]<m  ) { f[t]=j;  ++j  ; } 

8 } 

A random  permutation  where  0 appears  as  the  last  of  the  to  smallest  elements  is  computed  by: 

1 template  Ctypename  Type> 

2 void  random_lastm_permutation(Type  *f , ulong  n,  ulong  m) 

3 //  Random  permutation  such  that  0 appears  as  last  of  the  m smallest  elements. 

4 //  Must  have  mC=n. 

5 { 

6 random_permutation(f , n) ; 

7 if  ( mC=l  ) return; 

8 

9 ulong  p0=0,  pl=0;  //  position  of  0,  and  last  (in  m smallest  elements) 


10 

for 

(ulong  t=0,  j=0;  j<m;  ++t) 

11 

■c 

12 

if 

( f [t]  Cm  ) 

13 

{ 

14 

pi  = t ; //  update  position  of  last 

15 

if  ( f[t]==0  ) { pO  = t ; } //  record 

position  of  0 

16 

++j ; //  j out  of  m smallest  found 

17 

} 

18 

> 

19 

// 

here 

t is  the  position  of  the  last  of  the  m 

smallest  elements 

20 

swap2  ( f [pO]  , f [pi]  ) ; 

21  } 

2.5.5  Random  permutation  with  prescribed  cycle  type 


To  create  a random  permutation  with  given  cycle  type  (see  section 
routine  for  permuting  by  one  cycle  of  prescribed  length.  We  need  to 
elements.  The  positions  of  those  (available)  elements  are  stored  in  an  array 


of  unprocessed 
After  an  element  is 

processed  its  index  is  swapped  with  the  last  available  index  [FXT:  perm/permrand-cycle-type.h|: 


11.1.2|  on  page  |278|)  we  first  give  a 
ceep  track  of  the  set 

r [] . 


1 template  Ctypename  Type> 

2 inline  ulong  random_cycle(Type  *f , ulong  cl,  ulong  *r,  ulong  nr) 

3 //  Permute  a random  set  of  elements  (whose  positions  are  given  in 
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//  r[0],  ....  r[nr-l])  by  a random  cycle  of  length  cl. 

//  Must  have  nr  >=  cl  and  cl  !=  0. 

{ 

if  ( cl==l  ) //  just  remove  a random  position  from  r [] 

■C 

const  ulong  i = rand_idx(nr) ; 

— nr;  swap2(  r [nr] , r[i]  );  //  remove  position  from  set 

> 

else  //  cl  >=  2 

1 

const  ulong  iO  = rand_idx(nr) ; 

const  ulong  kO  = r[i0];  //  position  of  cycle  leader 

const  Type  fO  = f [kO] ; //  cycle  leader 

— cl ; 

— nr;  swap2(  r [nr]  , r[i0]  );  //  remove  position  from  set 

ulong  kp  = kO;  //  position  of  predecessor  in  cycle 
do  //  create  cycle 
{ 

const  ulong  i = rand_idx(nr) ; 

const  ulong  k = r [i] ; //  random  available  position 
f [kp]  = f [k]  ; //  move  element 

— nr;  swap2(  r [nr] , r [i]  );  //  remove  position  from  set 

kp  = k;  //  update  predecessor 

} 

while  ( — cl  ) ; 
f [kp]  = fO;  //  close  cycle 

> 

return  nr ; 

} 


To  permute  according  to  a cycle  type,  we  call  the  routine  according  to  the  elements  of  an  array  c []  that 
specifies  how  many  cycles  of  each  length  are  required: 

1 template  Ctypename  Type> 

2 inline  void  random_permute_cycle_type (Type  *f , ulong  n,  const  ulong  *c,  ulong  *tr=0) 

3 //  Permute  the  elements  of  f by  a random  permutation  of  prescribed  cycle  type. 

4 //  The  permutation  will  have  c [k]  cycles  of  length  k+1 . 

5 //  Must  have  s <=  n where  s :=  sum(k=0,  n-1,  c [k] ) . 

6 //  If  s < n then  the  permutation  will  have  n-s  fixed  points. 

7 { 

8 ulong  *r  = tr; 

9 if  ( tr==0  ) r = new  ulong [n] ; 


10 

for 

(ulong  k=0;  k<n;  ++k)  r[k]  = k;  //  initialize  set 

11 

ulong  nr  = n;  //  number  of  elements  available 

12 

// 

available  positions  are  r [0] , ....  r[nr-l] 

13 

14 

for 

(ulong  k=0;  k<n;  ++k) 

15 

1 

16 

ulong  nc  = c [k] ; //  number  of  cycles  of  length  k+1 ; 

17 

if  ( nc==0  ) continue;  //no  cycles  of  this  length 

18 

const  ulong  cl  = k+1 ; //  cycle  length 

19 

do 

20 

{ 

21 

nr  = random  cycle (f , cl,  r,  nr); 

22 

} 

23 

while  ( — nc  ) ; 

24 

} 

25 

26 

if 

( tr==0  ) delete  []  r; 

27  } 

2.5.6  Random  self-inverse  permutation 


For  the  self-inverse  permutations  (involutions)  we  need  to  compute  certain  branch  probabilities.  At  each 
step  either  a 2-cycle  or  a fixed  point  is  generated.  The  probability  that  the  next  step  generates  a fixed 
point  is  R{n ) = J(n  — 1 )//(n)  where  I(n)  is  the  number  of  involutions  of  n elements.  This  can  be  seen 
by  dividing  relation  11.1-6  on  page  279  by  /(n): 


I(n—  1)  (n—  l)/(n  — 2) 

I(n)  J(n) 


1 


(2.5-1) 
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At  each  step  we  generate  a random  number  t where  0 < i < 1,  if  £ > R{n)  then  a 2-cycle  is  created,  else 
a fixed  point.  The  quantities  I[n)  cannot  be  used  with  fixed  precision  arithmetic  because  an  overflow 
would  occur  for  large  n.  Instead,  we  update  R(n)  via 


R(n+  1) 


1 

1+n  R(n) 


The  recurrence  is  numerically  stable  [FXT:  perm/permrand-self-inverse.h 


inline  void  next_involution_branch_ratio (double  &rat , double  &nl) 

{ 

nl  +=  1.0; 

rat  = 1.0/ ( 1.0  + nl*rat  ); 


(2.5-2) 


The  following  routine  initializes  the  array  of  values  R(n) : 

inline  void  init_involution_branch_ratios (double  *b,  ulong  n) 

{ 

b [0]  = 1.0; 

double  rat  = 0.5,  nl  = 1.0; 
for  (ulong  k=l;  k<n;  ++k) 

1 

b [k]  = rat ; 

next_involution_branch_ratio(rat , nl) ; 

> 

} 

template  Ctypename  Type> 

inline  void  random_permute_self _inverse (Type  *f,  ulong  n, 

ulong  *tr=0,  double  *tb=0,  bool  bi=false) 

//  Permute  the  elements  of  f by  a random  self-inverse  permutation  (an  involution) . 
//  Set  bi:=true  to  signal  that  the  branch  probabilities  in  tb[] 

//  have  been  precomputed  (via  init_involution_branch_ratios () ) . 

{ 

ulong  *r  = tr; 

if  ( tr==0  ) r = new  ulong [n] ; 

for  (ulong  k=0;  k<n;  ++k)  r[k]  = k; 

ulong  nr  = n;  //  number  of  elements  available 

//  available  positions  are  r [0] r[nr-l] 

double  *b  = tb; 

if  ( tb==0  ) { b = new  double  [n] ; bi=false;  } 

if  ( !bi  ) init_involution_branch_ratios(b,  n) ; 

while  ( nr>=2  ) 

{ 

const  ulong  xl  = nr-1; 

const  ulong  rl  = r[xl];  //  available  position 
— nr;  //no  swap  needed  if  xl==last 

const  double  rat  = b[nr];  //  probability  to  choose  fixed  point 

const  double  t = rndOlQ;  //  0 <=  t < 1 
if  ( t > rat  ) //  2-cycle 

{ 

const  ulong  x2  = rand_idx(nr) ; 

const  ulong  r2  = r [x2] ; //  random  available  position  !=  rl 

— nr;  swap2(r[x2],  r [nr]  ) ; 
swap2(  f [rl]  , f [r2]  ); 

} 

//  else  //  fixed  point,  nothing  to  do 

> 

if  ( tr==0  ) delete  []  r; 
if  ( tb==0  ) delete  []  b; 

} 


The  auxiliary  function  randOlO  returns  a random  number  t where  0 < t < 1 [FXT:  auxO/randf.cc|. 


2.5.7  Random  derangement 

In  each  step  of  the  routine  for  a random  permutation  without  fixed  points  (a  derangement)  we  join  two 
cycles  and  decide  whether  to  close  the  resulting  cycle.  The  probability  of  closing  is  B(n)  = (n  — 1)  D(n  — 
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2 )/ D(n)  where  D(n)  is  the  number  of  derangements  of  n elements.  This  can  be  seen  by  dividing  relation 


ll.l-12a  on  page  280  by  D(n ): 


1 = 


(n  — 1)  D(n  — 1)  (n  — 1)  D(n  — 2) 


D(n) 


D(n) 


(2.5-3) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


1 

2 

3 

4 

5 


1 

2 

3 

4 

5 


1 

2 

3 

4 

5 

6 
7 


8 

9 

10 

11 

12 

13 

14 


17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 


The  probability  B (n)  is  close  to  1/n  for  large  n.  Already  for  n > 30  the  relative  error  (for  B(n)  versus 
1/n)  is  less  than  10~32,  so  B (n)  is  indistinguishable  from  1/n  with  floating-point  types  where  the  mantissa 
has  at  most  106  bits.  We  compute  a table  of  just  32  values  B(n)  [FXT:  perm/permrand-derange.hj: 

//  number  of  precomputed  branch  ratios: 

#define  NUM_PBR  32  //OK  for  up  to  106-bit  mantissa 

inline  void  init_derange_branch_ratios (double  *b) 

{ 

b [0]  = 0.0;  b [1]  = 1.0; 

double  dnO  = 1.0,  dnl  =0.0,  nl  = 1.0; 

for  (ulong  k=2 ; k<NUM_PBR;  ++k) 

1 

const  double  dn2  = dnl ; 
next_num_derangements (dnO , dnl,  nl) ; 

const  double  rat  = (nl)  * dn2/dn0;  //  ==  (n-1)  * D(n-2)  / D(n) 
b [k]  = rat ; 

} 

} 

The  D(n ) are  updated  using  D(n ) = (n  — 1)  [D(n  — 1)  + D(n  — 2)]: 

inline  void  next_num_derangements (double  ftdnO,  double  ftdnl,  double  &nl) 

{ 

const  double  dn2  = dnl;  dnl  = dnO;  nl  +=  1.0; 
dnO  = nl*(dnl  + dn2) ; 

} 

Now  the  B{n ) are  computed  as 

inline  double  derange_branch_ratio (const  double  *b,  ulong  n) 

{ 

if  ( n<NUM_PBR  ) return  b [n] ; 

else  return  1 . 0/ (double)n;  //  relative  error  < 1 . Oe-32 

} 

The  routine  for  a random  derangement  is 

template  Ctypename  Type> 

inline  void  random_derange (Type  *f,  ulong  n, 

ulong  *tr=0, 

double  *tb=0,  bool  bi=false) 

//  Permute  the  elements  of  f by  a random  derangement. 

//  Set  bi:=true  to  signal  that  the  branch  probabilities  in  tb[] 

//  have  been  precomputed  (via  init_derange_branch_ratios () ) . 

//  Must  have  n > 1 . 

{ 

ulong  *r  = tr; 

if  ( tr==0  ) r = new  ulong  [n] ; 

for  (ulong  k=0;  k<n;  ++k)  r[k]  = k; 

ulong  nr  = n;  //  number  of  elements  available 

//  available  positions  are  r [0] r[nr-l] 

double  *b  = tb; 

if  ( tb==0  ) { b = new  double [NUM_PBR] ; bi=false;  } 

if  ( !bi  ) init_derange_branch_ratios (b) ; 

while  ( nr>=2  ) 

{ 

const  ulong  xl  = nr-1;  //  last  element 

const  ulong  rl  = r[xl]; 

const  ulong  x2  = rand_idx(nr-l) ; //  random  element  !=last 

const  ulong  r2  = r [x2] ; 

swap2(  f [rl]  , f [r2]  );  //  join  cycles  containing  f [rl]  and  f [r2] 

//  remove  r[xl]=rl  from  set: 

— nr;  //  swap2(r[xl],  r[nr]);  //  swap  not  needed  if  xl==last 
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32 

33  const  double  rat  = derange_branch_ratio(b,  nr); 

34  const  double  t = rndOlQ;  //  0 <=  t < 1 

35  if  ( t < rat  ) //  close  cycle 

36  { 

37  //  remove  r [x2] =r2  from  set: 

38  — nr;  swap2(r[x2],  r [nr]  ) ; 

39  } 

40  //  else  cycle  stays  open 

41  } 

42 

43  if  ( tr==0  ) delete  []  r; 

44  if  ( tb==0  ) delete  []  b; 

45  } 

The  method  is  (essentially)  given  in  |245j . A generalization  for  permutations  with  all  cycles  of  length 
> to  is  given  in  [21] . 


2.5.8  Random  connected  permutation 

A random  connected  (indecomposable)  permutation  can  be  computed  via  the  rejection  method : create 
a random  permutation,  if  it  is  not  connected,  repeat.  An  implementation  is  [FXT:  perm/permrand- 
connected.h 

1 inline  void  random_connected_permutation(ulong  *f , ulong  n) 

2 { 

3 for  (ulong  k=0;  k<n;  ++k)  f [k]  = k; 

4 do  { random_permute  (f , n)  ; ]-  while  ( ! is_connected(f , n)  ); 

5 } 

The  method  is  efficient  because  the  number  of  connected  permutations  is  (asymptotically)  given  by 


C(n) 


(2.5-4) 


That  is,  the  test  for  connectedness  is  expected  to  fail  with  a probability  of  about  2 jn  for  large  n.  The 
probability  of  failure  can  be  reduced  to  about  2/n2  by  avoiding  the  permutations  that  fix  either  the  first 
or  the  last  element.  The  small  cases  (n  < 3)  are  treated  separately: 


1 

if 

( n<=3  ) 

2 

3 

for  (ulong  k=0;  k<n;  ++k)  f [k]  = ] 

4 

if  ( n<2  ) return;  //  []  or  [0] 

5 

swap2(f[0],  f[n-l]); 

6 

if  ( n==2  ) return;  //  [1,0] 

7 

//  here:  [2,1,0] 

8 

const  ulong  i = rand_idx(3); 

9 

swap2(f  [1]  , f [i]  ) ; 

10 

//  i = 0 ==>  [1,2,0] 

11 

//  i = 1 ==>  [2,1,0] 

12 

//  i = 2 ==>  [2,0,1] 

13 

return; 

14 

} 

| 

do 

17 

4 

18 

for  (ulong  k=0;  k<n;  ++k)  f [k]  = ] 

19 

20 

while  ( 1 ) 

21 

{ 

22 

const  ulong  iO  = 1 + rand_idx(i 

23 

const  ulong  il  = 1 + rand_idx(i 

24 

swap2(  f [0]  , f [iO]  ); 

25 

swap2  ( f [1]  , f [il]  ) ; 

26 

if  ( f[l]==n-l  ) //  undo  swap 

27 

{ 

28 

swap2(  f [1]  , f [il]  ) ; 

29 

swap2(  f [0]  , f [iO]  ); 

30 

continue;  //  probability 

31 

} 

32 

else  break; 

33 

} 

//  first  element  must  move 
//  f [1]  will  be  last  element 


(here : f [0] ! =0) 


0(1) 
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36 

37 

38 

39 


swap2(f  [1]  , f[n-l]);  //  move  f [1]  to  last 

//  here:  f [0]  !=  0 and  f [n-1]  !=  n-1 

random_permute (f +1 , n-2) ; //  permute  2nd  ...  2nd  last  element 

} 

while  ( ! is_connected(f , n)  ); 


2.6  The  revbin  permutation 


0 : [ * 

1:  [ 

2:  [ * 
3:  [ 

4:  [ * 

5:  [ 

6:  [ 

7:  [ 

8 : [ * 

9:  [ 

10:  [ 

11:  [ 

12:  [ * 
13:  [ 

14:  [ 

15:  [ 


] 

* ] 
] 

* ] 
] 

* ] 
] 

* ] 
] 

* ] 
] 

* ] 
] 

* ] 


* ] 


0 

1 

2 

3 

4 

5 

6 
7 


[ * 

[ * 

[ * 

[ * 
[ * 

[ * 

[ * 

[ 


] 

] 

] 

] 

] 

] 

] 

* ] 


0:  [ * ] 
1:  [ * ] 
2:  [ * ] 
3:  [ * ] 


Figure  2.6-A:  Permutation  matrices  of  the  revbin  permutation  for  sizes  16,  8 and  4.  The  permutation 


is  self-inverse. 


The  permutation  that  swaps  elements  whose  binary  indices  are  mutual  reversals  is  called  revbin  permu- 
tation (sometimes  also  bit-reversal  or  biti'ev  permutation).  For  example,  for  length  n = 256  the  element 
with  index  x = 43io  = OOIOIOII2  is  swapped  with  the  element  whose  index  is  x = IIOIOIOO2  = 212io- 
Note  that  x depends  on  both  x and  on  n.  Pseudocode  for  a naive  implementation  is 

1 procedure  revbin_permute (a [] , n) 

2 //  a[0..n-l]  input, result 

3 { 

4 for  x:=0  to  n-1 

5 1 

6 r :=  revbin (x,  n) 

7 if  r>x  then  swap(a[x]  , a[r]) 

8 > 

9 } 

The  condition  r>x  before  the  swapO  statement  makes  sure  that  the  swapping  is  not  undone  later  when 
the  loop  variable  x has  the  value  of  the  present  r. 


2.6.1  Computation  using  revbin-update 

The  key  ingredient  for  a fast  permutation  routine  is  the  observation  that  we  only  need  to  update  the 


bit-reversed  values:  given  x we  can  compute  x 
A faster  routine  will  be  of  the  form 

1 procedure  revbin_permute (a [] , n) 

2 //  a[0..n-l]  input, result 

3 { 

4 if  n<=2  return 

5 r :=  0 //  the  reversed  0 

6 for  x:=l  to  n-1 

7 { 

8 r :=  revbin_upd(r , n/2) 

9 if  r>x  then  swap(a[x]  , a[r]) 

10  > 

11  } 


1 efficiently  as  described  in  section 


1.14.3 


on  page 


36 


About  (n  — y/n)/ 2 swapO  statements  are  executed  with  the  revbin  permutation  of  n elements.  That  is, 
almost  every  element  is  moved  for  large  n,  as  there  are  only  a few  numbers  with  symmetric  bit  patterns: 
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n 

2 # swaps 

# symm.  pairs 

2 

0 

2 

4 

2 

2 

8 

4 

4 

16 

12 

4 

32 

24 

8 

64 

56 

8 

210 

992 

32 

220 

0.999  • 220 

210 

oo 

n — y/n 

\ fn 

The  sequence  is  entry  A045687  in  [3121: 

0,  2,  4,  12,  24,  56,  112,  238,  480,  992,  1980,  4032,  8064,  16242,  32512,  65280,  ... 


2.6.2  Exploiting  the  symmetries  of  the  permutation 


Symmetry  can  be  used  for  further  optimization:  if  for  even  x < ^ there  is  a swap  for  the  pair  (x,  x), 
then  there  is  also  a swap  for  the  pair  (n  — 1 — x,  n — 1 — x).  As  x < % and  x < f , one  has  n — 1 — x > 
and  n — 1 — x > " . That  is,  the  swaps  are  independent.  A routine  that  uses  these  observations  is 

1 procedure  revbin_permute (a [] , n) 

2 { 


3 

if 

n<=2  return 

4 

nh 

:=  n/2 

5 

r : 

:=  0 //  the  reversed  0 

6 

x : 

:=  1 

7 

while  x<nh 

8 

1 

9 

//  x odd: 

10 

r : = r + nh 

11 

swap(a[x]  , a[r]) 
x :=  x + 1 

14 

//  x even: 

15 

r :=  revbin_upd(r , n/2) 

16 

if  r>x  then 

17 

{ 

18 

swap  (a  [x],  a [r]  ) 

19 

swap(a[n-l-x]  , a[n-l-r]) 

20 

} 

21 

x :=  x + 1 

22 

} 

23 

} 

The  code  above  can  be  used  to  derive  an  optimized  version  for  zero  padded  data  (used  with  linear 
convolution,  see  section  22.1.4  on  page  4431: 


procedure  revbin_permuteO(a[] , n) 
{ 

if  n<=2  return 
nh  : = n/2 

r :=  0 //  the  reversed  0 

x : = 1 
while  x<nh 
1 


9 

//  x odd: 

10 

r : = r + nh 

11 

a [r]  :=  a|V 

12 

a[x]  :=  0 

11 

x :=  x + 1 

15 

//  x even: 

16 

r : = revbin. 

17 

if  r>x  then 

18 

/ / Omit  swa] 

19 

x :=  x + 1 

20 

> 

21  } 

upd(r , n) 
swap(a[x]  , a[r]) 

of  a[n-l-x]  and  a[n-l-r]  as  both  are  zero 


We  can  carry  the  scheme  further,  distinguishing  whether  x mod  4 = 0,  1,  2,  or  3,  as  done  in  the  implemen- 
tation [FXT:  perm/revbinpermute.h  . The  following  parameters  determine  how  much  of  the  symmetry 
is  used  and  which  version  of  the  revbin-update  routine  is  chosen: 


to|3 
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1 #define  RBP_SYMM  4 //  amount  of  symmetry  used:  1,  2,  4 (default  is  4) 

2 #define  FAST_REVBIN  //  define  if  using  revbin(x,  ldn)  is  faster  than  updating 

We  further  define  a macro  to  swap  elements: 

1 #define  idx_swap(k,  r)  { ulong  kx=(k)  , rx=(r);  swap2(f[kx],  f [rx] ) ; } 


The  main  routine  uses  unrolled  versions  of  the  revbin  permutation  for  small  values  of  n.  These  are  given 
in  [FXT:  perm/shortrevbinpermute.h  . For  example,  the  unrolled  routine  for  n = 16  is 

1 template  Ctypename  Type> 

2 inline  void  revbin_permute_16(Type  *f ) 

3 { 

4 swap2(f[l],  f [8]  ) ; 

5 swap2(f[2],  f [4]  ) ; 

6 swap2(f[3],  f [12]  5 ; 

7 swap2(f[5],  f [10]  ) ; 

8 swap2(f[7],  f [14]  ) ; 

9 swap2(f[ll],  f [13]  ) ; 

10  } 


The  code  was  generated  with  the  program  [FXT:  perm/cycles-demo.cc  , see  section  2.2  on  page  104  The 
routine  revbin_permute_leq_64(f  ,n) , which  is  called  for  n < 64,  selects  the  correct  routine  tor  the 
parameter  n: 


1 template  Ctypename  Type> 

2 void  revbin_permute (Type  *f , ulong  n) 

3 { 

4 if  ( n<=64  ) 

5 { 

6 revbin_permute_leq_64(f , n) ; 

7 return; 

8 } 

9 [ — snip — ] 

In  what  follows  we  set  RBP_SYMM  to  4,  define  FAST_REVBIN,  and  omit  the  corresponding  preprocessor 
statements.  Some  auxiliary  constants  have  to  be  computed: 


1 

const  ulong  ldn  = ld(n) ; 

2 

const  ulong  nh  = (n>>l) ; 

3 

const  ulong  nl  = n - 1; 

//  = 11111111 

4 

const  ulong  nxl  = nh  - 2; 

//  = 01111110 

5 

const  ulong  nx2  = nl  - nxl; 

//  = 10111101 

The 

main  loop  is 

1 

ulong  k = 0,  r = 0; 

2 

while  ( k < (n/RBP  SYMM)  ) 

//  n>=16 , n/2>=8 

3 

{ 

4 

// k"/,4  ==  0: 

5 

if  ( r>k  ) 

6 

{ 

7 

idx_swap(k,  r) ; // 

<nh,  <nh  11 

8 

idx_swap(nl~k,  nl'r) 

; //  >nh,  >nh  00 

9 

idx_swap(nxl~k,  nxl' 

r)  ; //  Cnh,  <nh 

10 

idx  swap(nx2~k,  nx2' 

r)  ; //  >nh,  >nh  1 

11 

} 

++k; 

14 

r “=  nh; 

15 

16 

// k*/.4  ==  1: 

17 

if  ( r>k  ) 

18 

{ 

19 

idx_swap(k,  r) ; // 

Cnh,  >nh  10 

20 

idx  swap(nl~k,  nl'r) 

; //  >nh,  Cnh  01 

21 

} 

11 

++k ; 

24 

r = revbin (k,  ldn) ; 

25 

26 

// k"/,4  ==  2: 

27 

if  ( r>k  ) 

28 

{ 

29 

idx_swap(k,  r) ; // 

Cnh,  Cnh  11 

30 

idx_swap(nl“k,  nl'r) 

; / / >nh , >nh  00 
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> 

++k ; 

r ~=  nh; 

// k"/,4  ==  3: 

if  ( r>k  ) 

{ 

idx_swap(k,  r) ; //  <nh,  >nh  10 

idx_swap(nxl~k,  nxl~r) ; //  <nh,  >nh  10 

} 

++k ; 

r = revbin (k,  ldn) ; 

> 

} //  end  of  the  routine 

For  large  n the  routine  takes  about  six  times  longer  than  a simple  array  reversal.  Much  of  the  time 
is  spent  waiting  for  memory  which  suggests  that  further  optimizations  would  best  be  attempted  with 
special  machine  instructions  to  bypass  the  cache  or  with  non-temporal  writes. 

A specialized  implementation  optimized  for  zero  padded  data  is  given  in  [FXT:  perm/revbinpermuteO.h  . 
Some  memory  accesses  can  be  avoided  for  that  case.  For  example,  revbin-pairs  with  both  indices  greater 
than  n/ 2 need  no  processing  at  all. 


31 


34 

35 

36 

37 

38 

39 

40 

41 


44 

45 

46 


2.6.3  A pitfall 


When  working  with  separate  arrays  for  the  real  and  imaginary  parts  of  complex  data,  one  could  remove 
half  of  the  bookkeeping  as  follows: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


procedure  revbin_permute  (a  []  , b []  , n) 

{ 

if  n<=2  return 
r :=  0 //  the  reversed  0 

for  x:=l  to  n-1 

r :=  revbin_upd(r , n/2)  //  inline  me 

if  r>x  then 
{ 

swap  (a  [x],  a [r]  ) 
swap  (b  [x]  , b [r]  ) 

} 

} 

} 


If  both  the  real  and  the  imaginary  part  fit  into  level-1  cache  the  method  can  lead  to  a speedup.  However, 
for  large  arrays  the  routine  can  be  much  slower  than  two  separate  calls  of  the  simple  method:  with  FFTs 
the  real  and  imaginary  element  for  the  same  index  typically  lie  apart  in  memory  by  a power  of  2,  leading 
to  a high  percentage  of  cache  misses  with  large  arrays. 


2.7  The  radix  permutation 

The  radix  permutation  is  the  generalization  of  the  revbin  permutation  to  arbitrary  radices.  Pairs  of 
elements  are  swapped  when  their  indices,  written  in  radix  r,  are  reversed.  For  example,  in  radix  10  and 
n = 1000  the  elements  with  indices  123  and  321  will  be  swapped.  The  radix  permutation  is  self-inverse. 

Code  for  the  radix  r permutation  of  the  array  f [ ] is  given  in  [FXT:  perm/radixpermute.h|.  The  routine 
must  be  called  with  n a perfect  power  of  the  radix  r.  Radix  r = 2 gives  the  revbin  permutation. 

1 extern  ulong  radix_permute_nt [] ; //  ==  9,  90,  900,  ...  for  r=10 

2 extern  ulong  radix_permute_kt [] ; //  ==  1,  10,  100,  ...  for  r=10 

3 #define  NT  radix_permute_nt 

4 #define  KT  radix_permute_kt 

<3  template  Ctypename  Type> 

7 void  radix_permute (Type  *f , ulong  n,  ulong  r) 

8 { 

9 ulong  x = 0; 

10  NT  [0]  = r-1; 

11  KT  [0]  = 1; 
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12 

while  ( 1 ) 

13 

1 

14 

ulong  z = KT[x]  * r; 

15 

if  ( z>n  ) break; 

16 

++x; 

KT  [x]  = z ; 

17 

18 

19 

NT [x]  = NT [x-1]  * r; 

20 

} 

21 

22 

// 

here:  n ==  p**x 

23 

for 

(ulong  i=0,  j=0;  i < n-1;  i++) 

24 

1 

25 

OR 

if  ( i<j  ) swap2(f  [i]  , f[j]); 

$ 

ulong  t = x - 1 ; 

28 

29 

ulong  k = NT [t] ; //  =“=  k = (r-1) 

30 

while  ( k<=j  ) 

31 

{ 

32 

j -=  k; 

33 

u 

II 

\ 

X 

II 

< 

II 

\ 

1 — 1 
•p 
1 
1 

1 1 

H 

S 

II 

34 

35 

} 

36 

j +=  KT [t] ; //  =“=  j +=  (k/ (r-1) ) ; 

37 

38  } 

} 

2.8  In-place  matrix  transposition 


Transposing  a matrix  is  easy  when  it  is  not  done  in-place.  The  following  routine  does  the  job  [FXT: 
aux2  / transpose . h | : 

1 template  Ctypename  Type> 

2 void  transpose (const  Type  * restrict  f,  Type  * restrict  g,  ulong  nr,  ulong  nc) 

3 //  Transpose  nr  x nc  matrix  f []  into  an  nc  x nr  matrix  g[]  . 

4 { 


5 

for 

(ulong  r=0;  r<nr;  r++) 

6 

1 

7 

ulong  isrc  = r * nc; 

8 

ulong  idst  = r; 

9 

for  (ulong  c=0;  c<nc ; C++) 

10 

{ 

11 

g[idst]  = f [isrc]  ; 

12 

isrc  +=  1 ; 

13 

idst  +=  nr; 

14 

} 

15 

16  } 

} 

Matters  get  more  complicated  for  the  in-place  equivalent.  We  have  to  find  the  cycles  (see  section  2.2  on 
page  104)  of  the  underlying  permutation.  To  transpose  a nr  x nc  matrix  first  identify  the  position  i of 
the  entry  in  row  r and  column  c: 


i = r ■ nc  + c (2-8-1) 

After  the  transposition  the  element  will  be  at  position  i'  in  the  transposed  n!r  x n!c  matrix 

i!  = r'-n’c  + c'  (2.8-2) 

We  have  r'  = c,  d = r,  n'r  = nc  and  n'c  = nr,  so 

i!  = c • nr  + r (2.8-3) 


Multiplying  the  last  equation  by  nc  gives 


i ■ n, 


c ■ nr  ■ nc  + r ■ n, 


(2.8-4) 
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With  n :=  nr  ■ nc  and  r • nc  = i — c we  find 


c • n + i — c 
i!  ■ nc  — c ■ (n  — 1) 


(2.8-5) 

(2.8-6) 


Take  the  equation  modulo  n — 1 to  obtain 


i = i'  ■ nc  mod  n — 1 (2.8-7) 

That  is,  the  transposition  moves  the  element  i = i'  ■ nc  to  position  i' . Multiply  by  nr  to  find  the  inverse: 

i ■ nr  = i'  ■ nc  ■ nr  = i'  ■ {n  — 1 + 1)  = i'  (2.8-8) 

That  is,  element  i will  be  moved  to  i'  = i ■ nr  mod  n — 1.  The  following  routine  uses  a bit-array  to  keep 

track  of  the  elements  processed  so  far  [FXT:  aux2/transpose.h  : 

1 #define  SRC(k)  (((unsigned  long  long)  (k)  *nc)°/0nl) 

3 template  Ctypename  Type> 

4 void  transpose (Type  *f , ulong  nr,  ulong  nc,  bitarray  *ba=0) 

5 //  In-place  transposition  of  an  nr  X nc  array 

6 //  that  lies  in  contiguous  memory. 

7 { 


8 

if 

( l>=nr  ) return; 

9 

if 

( l>=nc  ) return; 

10 

11 

if 

( nr==nc  ) transpose_square (f , nr); 

12 

else 

13 

14 

const  ulong  nl  = nr  * nc  - 1 ; 

111 

bitarray  *tba  = 0; 

17 

if  ( 0==ba  ) tba  = new  bitarray (nl) ; 

18 

else  tba  = ba; 

19 

tba->clear_all () ; 

20 

21 

for  (ulong  k=l;  k<nl;  k=tba->next  clear (++k)  ) 

22 

{ 

23 

//  do  a cycle: 

24 

ulong  ks  = SRC(k); 

25 

ulong  kd  = k; 

26 

tba->set (kd) ; 

27 

Type  t = f [kd]  ; 

28 

while  ( ks  !=  k ) 

29 

{ 

30 

f [kd]  = f [ks]  ; 

31 

kd  = ks ; 

32 

tba->set (kd) ; 

33 

ks  = SRC(ks) ; 

34 

} 

35 

f [kd]  = t ; 

36 

} 

37 

38 

if  ( 0==ba  ) delete  tba; 

39 

> 

40  } 

One  should  take  care  of  possible  overflows  in  the  calculation  ■ 

2 (and 

SO 

are  both  nr  and  nc)  the  multiplications  modulo  n - 

//  0 and  nl  are  fixed  points 


can  be  avoided  and  the  computation  is  also  significantly  cheaper.  An  implementation  is  given  in  [FXT: 
aux2/transpose2.h  . 


2.9  Rotation  by  triple  reversal 

To  rotate  a length-n  array  by  s positions  without  using  any  temporary  memory,  reverse  three  times  as 
in  the  following  routine  [FXT:  perm/rotate. h : 

1 template  Ctypename  Type> 

2 void  rotate_lef t (Type  *f , ulong  n,  ulong  s) 
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Rotate  left  by  3 positions: 

[ 1 2 3 4 5 6 

7 

8 ] 

original  array 

[ 3 2 1 4 5 6 

7 

8 1 

reverse  first  3 elements 

[ 3 2 1 8 7 6 

5 

4 I 

reverse  last  8-3=5  elements 

[ 4 5 6 7 8 1 

2 

3 J 

reverse  whole  array 

Rotate  right  by 

3 

positions : 

[ 1 2 3 4 5 6 

7 

8 ] 

original  array 

[ 5 4 3 2 1 6 

7 

8 1 

reverse  first  8-3=5  elements 

[ 5 4 3 2 1 8 

7 

6 1 

reverse  last  3 elements 

[ 6 7 8 1 2 3 

4 

5 J 

reverse  whole  array 

Figure  2. 9- A:  Rotation  of  a length-8  array  by  3 positions  to  the  left  (top)  and  right  (bottom). 


3 //  Rotate  towards  element  #0 

4 //  Shift  is  taken  modulo  n 

5 { 

6 if  ( s>=n  ) 

7 -c 

8 if  (n<2)  return; 

9 s 7,=  n; 


10 

} 

11 

12 

if  ( s==0  ) 

return 

13 

reverse  (f , 

s) ; 

14 

reverse (f +s , 

n-s) ; 

15 

16 

} 

reverse (f , 

n) ; 

We  will  call  this  trick  the  triple  reversal  technique.  For  example,  left-rotating  an  8-element  array  by 
3 positions  is  achieved  by  the  steps  shown  in  figure  2.9-A  (top).  A right  rotation  of  an  n-element  array 
by  s positions  is  identical  to  a left  rotation  by  n — s positions  (bottom  of  figure  2.9-A): 


1 template  Ctypename  Type> 

2 void  rotate_right (Type  *f , ulong  n,  ulong  s) 

3 //  Rotate  away  from  element  #0 

4 //  Shift  is  taken  modulo  n 


5 

6 
7 

{ 

if  ( s>=n  ) 
{ 

8 

9 

10 

if  (n<2) 
s "/,=  n; 

} 

return 

11 

12 

if  ( s==0  ) 

return 

13 

reverse  (f , 

n-s) 

14 

reverse (f +n- 

s,  s)  ; 

15 

16 

} 

reverse (f , 

n)  ; 

We  could  also  execute  the  (self-inverse)  steps  of  the  left-shift  routine  in  reversed  order: 


reverse (f , n) ; 
reverse (f +s , n-s) ; 
reverse  (f  , s) ; 
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reverse  whole  range 

< — = the  swapped  blocks 

Figure  2.9-B:  Swapping  the  blocks  [abode]  and  [w  x y z]  via  4 reversals. 


The  triple  reversal  trick  can  also  be  used  to  swap  two  blocks  in  an  array:  first  reverse  the  three  ranges  (first 
blocks,  range  between  blocks,  last  block),  then  reverse  the  range  that  consists  of  all  three.  We  will  call  this 
trick  the  quadruple  reversal  technique.  The  corresponding  code  is  given  in  [FXT:  perm/swapblocks.h  : 


2.10:  The  zip  permutation 
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1 template  Ctypename  Type> 

2 void  swap_blocks (Type  *f , ulong  xl,  ulong  nl , ulong  x2,  along  n2) 

3 //  Swap  the  blocks  starting  at  indices  xl  and  x2 

4 //  nl  and  n2  are  the  block  lengths 

5 { 

6 if  ( xl>x2  ) 1 swap2 (xl ,x2) ; swap2(nl ,n2) ; } 

7 f +=  xl; 

8 x2  -=  xl; 

9 ulong  n = x2  + n2; 

10  reverse (f,  nl) ; 

11  reverse (f+nl , n-nl-n2) ; 

12  reverse(f+x2,  n2) ; 

13  reverse (f,  n) ; 

14  } 

The  elements  before  xl  and  after  x2+n2  are  not  accessed.  An  example  is  shown  in  figure  |2.9-B|  The 
listing  was  created  with  the  program  [FXT:  perm/swap-blocks-demo.cc  . 

A routine  to  undo  the  effect  of  swap_blocks  (f , xl , nl , x2,  n2)  can  be  obtained  by  reversing  the 
order  of  the  steps: 

1 template  Ctypename  Type> 

2 void  inverse_swap_blocks (Type  *f , ulong  xl,  ulong  nl , ulong  x2,  ulong  n2) 

3 { 

4 if  ( xl>x2  ) 1 swap2 (xl ,x2) ; swap2(nl ,n2) ; } 

5 f +=  xl; 

6 x2  -=  xl; 

7 ulong  n = x2  + n2; 

8 reverse (f , n) ; 

9 reverse(f+x2,  n2) ; 

10  reverse (f+nl , n-nl-n2) ; 

11  reversed,  nl)  ; 

12  } 

An  alternative  method  is  to  call  swap_blocks (f , xl,  n2,  x2+n2-nl,  nl). 


2.10  The  zip  permutation 


Figure  2.10-A  : Permutation  matrices  of  the  zip  permutation  (left)  and  its  inverse  (right). 


The  zip  permutation  moves  the  elements  from  the  lower  half  to  the  even  indices  and  the  elements  from 
the  upper  half  to  the  odd  indices.  Symbolically, 

[abcdABCD]  |->  [ a A b B c C d D ] 

The  size  of  the  array  must  be  even.  A routine  for  the  permutation  is  [FXT:  perm/zip. h 

1 template  Ctypename  Type> 

2 void  zip (const  Type  * restrict  f,  Type  * restrict  g,  ulong  n) 

3 { 

4 ulong  nh  = n/2; 

5 for  (ulong  k=0,  k2=0;  kCnh;  ++k,  k2+=2)  g[k2]  = f [k] ; 


G~t  4^  CO  to  l — 1 cnrf^cotoi— 1 -sioicn  4^00  to  H > -^0.01  4^C0  to  H > i 1 Oi  Cn  4^  CO  tO 
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6 for  (ulong  k=nh,  k2=l;  k<n;  ++k,  k2+=2)  g[k2]  = f [k] ; 

7 } 

The  inverse  of  the  zip  permutation  is  the  unzip  permutation , it  moves  the  even  indices  to  the  lower  half 
and  the  odd  indices  to  the  upper  half: 

1 template  Ctypename  Type> 

void  unzip(const  Type  * restrict  f,  Type  * restrict  g,  ulong  n) 

{ 

ulong  nh  = n/2; 

for  (ulong  k=0,  k2=0;  k<nh;  ++k,  k2+=2)  g[k]  = f [k2] ; 

for  (ulong  k=nh,  k2=l;  k<n;  ++k,  k2+=2)  g[k]  = f [k2] ; 

} 
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Figure  2.10-B:  Revbin  permutation  matrices  that,  when  multiplied  together,  give  the  zip  permutation 
and  its  inverse.  Let  L and  R be  the  permutations  given  on  the  left  and  right  side,  respectively.  Then 
Z = RL  and  Z^1  = L R. 


If  the  array  size  n is  a power  of  2,  we  can  compute  the  zip  permutation  as  a transposition  of  a 2 x n/2- 
matrix: 


template  Ctypename  Type> 
void  zip (Type  *f , ulong  n) 

{ 

ulong  nh  = n/2; 

revbin_permute (f , nh) ; revbin_permute (f +nh,  nh) ; 
revbin_permute (f , n) ; 

} 

e in-place  version  for  the  unzip  permutation  for  arrays  whose  size  is  a power  of  2 is 

template  Ctypename  Type> 
void  unzip (Type  *f,  ulong  n) 

{ 

ulong  nh  = n/2; 
revbin_permute (f , n) ; 

revbin_permute (f , nh) ; revbin_permute (f +nh,  nh) ; 

} 


If  the  type  Complex  consists  of  two  doubles  lying  contiguous  in  memory,  then  we  can  optimize  the 
procedures  as  follows: 


void  zip (double  *f,  long  n) 

{ 

revbin_permute (f , n) ; 
revbin_permute( (Complex  *)f,  n/2); 

} 

void  unzip (double  *f , long  n) 

{ 

revbin_permute( (Complex  *)f,  n/2); 
revbin_permute (f , n) ; 

} 
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For  arrays  whose  size  n is  not  a power  of  2 the  in-place  zip  permutation  can  be  computed  by  transposing 
the  data  as  a 2 x n/2  matrix: 


transpose (f,  2,  n/2);  // 


zip(f,  n) 


The  routines  for  in-place  transposition  are  given  in  section  2.8  on  page  122  The  inverse  is  computed  by 
transposing  the  data  as  an  n/2  x 2 matrix: 


transpose (f,  n/2,  2);  // 


unzip (f,  n) 


While  the  above  mentioned  technique  is  usually  not  a gain  for  doing  a transposition  it  may  be  used  to 
speed  up  the  revbin  permutation  itself. 


2.11  The  XOR  permutation 
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Figure  2.11-A:  Permutation  matrices  of  the  XOR  permutation  for  length  8 with  parameter  x = 0 ...  7. 
Compare  to  the  table  for  the  dyadic  convolution  shown  in  figure  |23.8-A|  on  page  |481| 


The  XOR  permutation  (with  parameter  a;)  swaps  the  element  at  index  k with  the  element  at  index 
x XOR  k (see  figure  2.11-A I.  The  implementation  is  easy  [FXT:  perm/xorpermute.h  : 


1 template  Ctypename  Type> 

void  xor_permute (Type  *f , ulong  n,  ulong  x) 
{ 

if  ( 0==x  ) return; 
for  (ulong  k=0;  k<n;  ++k) 

{ 

ulong  r = k~x; 

if  ( r>k  ) swap2(f[r],  f [k]  ) ; 

> 

10  } 


The  XOR  permutation  is  clearly  self-inverse.  The  array  length  n must  be  divisible  by  the  smallest  power 
of  2 that  is  greater  than  x.  For  example,  n must  be  even  if  x = 1 and  n must  be  divisible  by  4 if  x = 2 
or  x = 3.  With  n a power  of  2 and  x < n one  is  on  the  safe  side. 

The  XOR  permutation  contains  a few  other  permutations  as  important  special  cases  (for  simplicity 
assume  that  the  array  length  n is  a power  of  2):  If  the  third  argument  x equals  n — 1,  the  permutation 
is  the  reversal.  With  x = 1 neighboring  even  and  odd  indexed  elements  are  swapped.  With  x = n/2  the 
upper  and  the  lower  half  of  the  array  are  swapped. 

We  have 


XaXb 


Xf,  Xa  = Xc  where  c = a XOR  b 


(2.11-1) 
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For  the  special  case  a = b the  relation  does  express  the  self-inverse  property  as  Xo  is  the  identity.  The 
XOR  permutation  occurs  in  relations  between  other  permutations  where  we  will  use  the  symbol  Xai  the 
subscript  a denoting  the  third  argument  in  the  given  routine. 

2.12  The  Gray  permutation 
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Figure  2. 12- A : Permutation  matrices  of  the  Gray  permutation  (left)  and  its  inverse  (right). 
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The  Gray  permutation  reorders  (length-2”)  arrays  according  to  the  binary  Gray  code  described  in  sec- 
tion 1.16  on  page  41  A routine  for  the  permutation  is  [FXT:  perm/graypermute.h  : 


template  Ctypename  Type> 

inline  void  gray_permute (const  Type  *f,  Type  * restrict  g,  ulong  n) 
//  Put  Gray  permutation  of  f []  to  g[],  i.e.  g[gray_code(k)]  ==  f [k] 
{ 

for  (ulong  k=0;  k<n;  ++k)  g[gray_code(k)]  = f [k] ; 

} 


Its  inverse  is 

template  Ctypename  Type> 

inline  void  inverse_gray_permute (const  Type  *f , Type  * restrict  g,  ulong  n) 
//  Put  inverse  Gray  permutation  of  f []  to  g[],  i.e.  g[k]  ==  f [gray_code(k)] 
//  (same  as:  g[inverse_gray_code(k)]  ==  f [k] ) 

{ 

for  (ulong  k=0;  k<n;  ++k)  g[k]  = f [gray_code(k)] ; 

} 


We  again  use  calls  to  the  routine  to  compute  the  Gray  code  because  they  are  cheaper  than  the  compu- 
tations of  the  inverse  Gray  code. 


2.12.1  Cycles  of  the  permutation 


We  want  to  create  in-place  versions  of  the  Gray  permutation  routines.  It  is  necessary  to  identify  the  cycle 
leaders  of  the  permutation  (see  section  2.2  on  page  104)  and  find  an  efficient  way  to  generate  them. 


It  is  instructive  to  study  the  complementary  masks  that  occur  for  cycles  of  different  lengths.  The  cy- 
cles of  the  Gray  permutation  for  length  128  are  shown  in  figure  |2.12-B[  No  structure  is  immediately 
visible.  However,  we  can  generate  the  cycle  maxima  as  follows:  for  each  range  2k  . . . 2k+1  — 1 gener- 
ate a bit-mask  z that  consists  of  the  k + 1 leftmost  bits  of  the  infinite  word  that  has  ones  at  positions 
0,1,  2,4,  8,...,  2%...: 


[111010001000000010000000000000001000  . . . ] 

An  example:  for  k = 6 we  have  z = [1110100] . Then  take  v to  be  k + 1 leftmost  bits  of  the  complement, 
v = [0001011]  in  our  example.  Now  the  set  of  words  c = z + s where  s is  a subset  of  v contains  exactly 
one  element  of  each  cycle  in  the  range  2k  . . . 2k+1  — 1 = 64 . . . 127,  indeed  the  maximum  of  the  cycle: 
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Figure  2.12-B:  Cycles  of  the  Gray  permutation  of  length  128. 
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The  sequence  of  cycle  maxima  is  entry  A175339  in  [312],  The  minima  (entry  A175338)  of  the  cycles  can 
be  computed  similarly: 


1 

= 64 

1 1 

= 65 

1 1. 

= 66 

1 11 

= 67 

1.  .1.  . . 

= 72 

1.  .1.  .1 

= 73 

1.  .1.1. 

= 74 

1. .1.11 

= 75 

minima 

= z XOR 

z XOR  subsets (v)  where  z 


.1. 


and  v = ....1.11 


The  list  can  be  generated  with  the  program  [FXT:  perm/permgray-leaders-demo.cc  which  uses  the 
routine  [FXT:  class  gray_cycle_leaders  in  comb/gray-cycle-leaders. h|: 

1 class  gray_cycle_leaders 

2 //  Generate  cycle  leaders  for  Gray  permutation 

3 //  where  highest  bit  is  at  position  ldn. 

4 { 

5 public: 

6 bit_subset  b_; 

7 ulong  za_ ; //  mask  for  cycle  maxima 

8 ulong  zi_;  //  mask  for  cycle  minima 

9 ulong  len_;  //  cycle  length 

10  ulong  num_;  //  number  of  cycles 

public: 

13  gray_cycle_leaders (ulong  ldn)  //  0<=ldn<BITS_PER_L0NG 

14  : b_(0) 

15  { init(ldn);  } 

16 

17  ~gray_cycle_leaders ()  {;} 

18 

19  void  init (ulong  ldn) 

20  { 

21  za_  = 1 ; 

22  ulong  cz  = 0;  //  ~z 

23  len_  = 1; 

24  num_  = 1 ; 

25  for  (ulong  ldm=l;  ldm<=ldn;  ++ldm) 
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za_  «=  1 ; 
cz  <<=  1; 

if  ( is_pow_of _2(ldm)  ) 
{ 

++za_ ; 
len_  «=  1 ; 

} 

else 

{ 

++cz ; 

num_  «=  1 ; 

} 

} 

zi_  = 1UL  <<  ldn; 
b_ . first (cz) ; 


ulong  current_max()  const  { return  b_.current()  I za_;  } 
ulong  current_min()  const  { return  b_. current ()  I zi_;  } 

bool  next()  { return  ( 0!=b_.next()  );  } 

ulong  num_cycles()  const  { return  num_ ; } 
ulong  cycle_length()  const  { return  len_;  } 

The  implementation  uses  the  class  for  subsets  of  a bitset  described  in  section  |1.25  on  page  68| 
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2.12.2  In-place  routines 

The  in-place  versions  of  the  permutation  routines  are  obtained  by  inlining  the  generation  of  the  cycle 
leaders.  The  forward  version  is  [FXT:  perm/graypermute.h  : 

template  Ctypename  Type> 
void  gray_permute (Type  *f , ulong  n) 

{ 

ulong  z = 1;  //  mask  for  cycle  maxima 
ulong  v = 0;  //  ~z 
ulong  cl  = 1 ; //  cycle  length 

for  (ulong  ldm=l,  m=2;  m<n;  ++ldm,  m«=l) 

{ 

z «=  1; 
v «=  1; 

if  ( is_pow_of _2 (ldm)  ) 

{ 

++Z  l 

cl  «=  1; 

else  ++v; 

bit_subset  b(v) ; 
do 


//  do  cycle:  

ulong  i = z | b.nextO; 

Type  t = f [i]  ; 

ulong  g = gray_code(i) ; 

for  (ulong  k=cl-l;  k!=0; 

{ 

Type  tt  = f [g] ; 
f[g]  = t; 
t = tt ; 

g = gray_code (g) ; 


//  start  of  cycle 
//  save  start  value 
//  next  in  cycle 
— k) 


} 

f[gl 
//  - 


t ; 

end  (do  cycle) 


> 


while  ( b. current ()  ); 


The  function  is_pow_of_2()  is  described  in  section  1.7  on  page  17  The  inverse  routine  differs  only  in 
the  block  that  processes  the  cycles: 
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1 template  Ctypename  Type> 

2 void  inverse_gray_permute(Type  *f , ulong  n) 

3 { 

4 [ — snip — ] 

5 //  do  cycle:  

6 ulong  i = z | b.nextQ; 

7 Type  t = f [i]  ; 

8 ulong  g = gray_code(i)  ; 

9 for  (ulong  k=cl-l;  k!=0; 

10  { 

11  f [i]  = f [g]  ; 

12  i = g; 

13  g = gray_code  (i)  ; 

14  } 

15  f [i]  = t ; 

16  //  end  (do  cycle)  — 

17  [ — snip — ] 

18  } 


//  start  of  cycle 
//  save  start  value 
//  next  in  cycle 
— k) 


The  Gray  permutation  is  used  with  certain  Walsh  transforms,  see  section  23.7  on  page  474 


2.12.3  Performance  of  the  routines 

We  use  the  convention  that  the  time  for  an  array  reversal  is  1.0.  The  operation  is  completely  cache- friendly 
and  therefore  fast.  A simple  benchmark  gives  for  16  MB  arrays: 

arg  1 : 21  ==  ldn  [Using  2**ldn  elements]  default=21 
arg  2:  10  ==  rep  [Number  of  repetitions]  default=10 
Memsize  = 16384  kiloByte  ==  2097152  doubles 


reverse (f ,n) ; 

dt= 

0.0103524 

MB/ s=  1546 

rel= 

1 

revbin_permute (f ,n) ; 

dt= 

0.0674235 

MB/ s=  237 

rel= 

6.51282 

revbin_permute0(f ,n) ; 

dt= 

0.061507 

MB/ s=  260 

rel= 

5.94131 

gray_permute (f ,n) ; 

dt= 

0.0155019 

MB/ s=  1032 

rel= 

1.49742 

inverse_gray_permute (f ,n) ; 

dt= 

0.0150641 

MB/ s=  1062 

rel= 

1.45512 

The  revbin  permutation  takes  about  6.5  units,  due  to  its  memory  access  pattern  that  is  very  problematic 
with  respect  to  cache  usage.  The  Gray  permutation  needs  only  1.50  units.  The  difference  gets  bigger  for 
machines  with  relatively  slow  memory  with  respect  to  the  CPU. 

The  relative  speeds  are  quite  different  for  small  arrays.  With  16  kB  (2048  doubles)  we  obtain 

arg  1:  11  ==  ldn  [Using  2**ldn  elements]  default=21 
arg  2:  100000  ==  rep  [Number  of  repetitions]  default=512 
Memsize  = 16  kiloByte  ==  2048  doubles 


reverse (f ,n) ; 

dt=1.88726e-06 

MB/ s=  8279 

rel= 

1 

revbin_permute (f ,n) ; 

dt=3.22166e-06 

MB/ s=  4850 

rel= 

1.70706 

revbin_permute0(f ,n) ; 

dt=2 . 69212e-06 

MB/ s=  5804 

rel= 

1.42647 

gray_permute (f ,n) ; 

dt=4 . 75155e-06 

MB/ s=  3288 

rel= 

2.51769 

inverse_gray_permute (f ,n) ; 

dt=3 . 69237e-06 

MB/ s=  4232 

rel= 

1.95647 

Due  to  the  small  size,  the  cache  problems  are  gone. 

2.13  The  reversed  Gray  permutation 

The  reversed  Gray  permutation  of  a length-n  array  is  computed  by  permuting  the  elements  in  the  way 
that  the  Gray  permutation  would  permute  the  upper  half  of  an  array  of  length  2 n.  The  array  size  n must 
be  a power  of  2.  An  implementation  is  [FXT:  perm/grayrevpermute.h  : 

1 template  Ctypename  Type> 

2 inline  void  gray_rev_permute (const  Type  *f , Type  * restrict  g,  ulong  n) 

3 //  gray_rev_permute 0 =“= 

4 111  reverseO;  gray_permute ()  ; } 

5 { 

6 for  (ulong  k=0,  m=n-l;  k<n;  ++k,  — m)  g[gray_code(m)]  = f [k] ; 

7 } 

All  cycles  have  the  same  length,  the  cycles  with  n = 64  elements  are 
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Figure  2. 13- A : Permutation  matrices  of  the  reversed  Gray  permutation  (left)  and  its  inverse  (right). 
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If  64  is  added  to  the  indices,  the  cycles  in  the  upper  half  of  the  array  as  in  gray_permute(f , 128)  are 
reproduced.  The  in-place  version  of  the  permutation  routine  is 
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template  Ctypename  Type> 

void  gray_rev_permute (Type  *f , ulong  n) 

//  n must  be  a power  of  2,  n<=2** (BITS_PER_L0NG-2) 

{ 

f -=  n;  //  note! 

ulong  z = 1;  //  mask  for  cycle  maxima 
ulong  v = 0;  //  ~z 
ulong  cl  = 1;  //  cycle  length 

ulong  1dm,  m; 

for  (ldm=l,  m=2;  m<=n;  ++ldm,  m«=l) 

1 

z «=  1;  v «=  1; 

if  ( is_pow_of _2  (ldm)  ) { ++z;  cl«=l;  } 

else  ++v; 

} 

ulong  tv  = v,  tu  = 0;  //  cf . bitsubset.h 

do 

1 

tu  = (tu-tv)  & tv; 

ulong  i = z | tu;  //  start  of  cycle 

//  do  cycle:  

ulong  g = gray_code (i) ; 

Type  t = f [i]  ; 

for  (ulong  k=cl-l;  k!=0;  — k) 

{ 

Type  tt  = f [g]  ; 
f[g]  = t; 
t = tt ; 

g = gray_code(g) ; 

} 

f Cg]  = ; 

//  end  (do  cycle)  

} 

while  ( tu  ) ; 
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38  } 

The  routine  for  the  inverse  permutation  again  differs  only  in  the  way  the  cycles  are  processed: 

1 template  <typename  Type> 

2 void  inverse_gray_rev_permute (Type  *f , ulong  n) 

3 { 

4 [ — snip — ] 

5 //  do  cycle:  

6 Type  t = f [i] ; //  save  start  value 

7 ulong  g = gray_code (i) ; //  next  in  cycle 

8 for  (ulong  k=cl-l;  k!=0;  — k) 

9 { 

10  f [i]  = f [g]  ; 

11  i = g; 

12  g = gray_code(i) ; 

13  } 

14  f[i]  = t; 

15  //  end  (do  cycle)  

16  [ — snip — ] 

17  } 


Let  G denote  the  Gray  permutation,  G the  reversed  Gray  permutation,  r be  the  reversal,  h the  swap 


of  the  upper  and  lower  halves,  and  Xa  the  XOR  permutation  (with  parameter  a)  from  section  2.11  on 
|page  127]  We  have 


G 

= Gr  = hG 

(2.13-la) 

G -1 

= rG -1 

(2.13-lb) 

G~XG 

= G-1  G = r = X„_! 

(2.13-lc) 

GG_1 

= GO’1  = h = Xn/2 

(2.13-ld) 

134 


Chapter  3:  Sorting  and  searching 


Chapter  3 

Sorting  and  searching 


We  give  various  sorting  algorithms  and  some  practical  variants  of  them,  like  sorting  index  arrays  and 
pointer  sorting.  Searching  methods  both  for  sorted  and  for  unsorted  arrays  are  described.  Finally  we 
give  methods  for  the  determination  of  equivalence  classes. 

3.1  Sorting  algorithms 

We  give  sorting  algorithms  like  selection  sort,  quicksort,  merge  sort,  counting  sort  and  radix  sort.  A 
massive  amount  of  literature  exists  about  the  topic  so  we  will  not  explore  the  details.  Very  readable  texts 
are  m and  .306; , while  in-depth  information  can  be  found  in  [214j . 

3.1.1  Selection  sort 
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Figure  3.1-A:  Sorting  the  string  ‘nowsortme’  with  the  selection  sort  algorithm. 


There  are  a several  algorithms  for  sorting  that  have  complexity  O (n2)  where  n is  the  size  of  the  array 
to  be  sorted.  Here  we  use  selection  sort , where  the  idea  is  to  find  the  minimum  of  the  array,  swap  it 
with  the  first  element,  and  repeat  for  all  elements  but  the  first.  A demonstration  of  the  algorithm  is 
shown  in  figure  3.1-A  this  is  the  output  of  [FXT:  sort/selection-sort-demo.cc  . The  implementation  is 
straightforward  [FXT:  sort/sort. h : 
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template  Ctypename  Type> 

void  selection_sort (Type  *f , ulong  n) 

//  Sort  f []  (ascending  order). 

//  Algorithm  is  0(n*n),  use  for  short  arrays  only. 

{ 

for  (ulong  i=0;  i<n;  ++i) 

{ 

Type  v = f [i]  ; 

ulong  m = i;  //  position  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( f [ j ] <v  ) 

{ 

m = j; 
v = f [m]  ; 

} 

} 
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19 

20  swap2(f[i],  f [m]  ) ; 

21  } 

22  } 

A verification  routine  is  always  handy: 

1 template  Ctypename  Type> 

2 bool  is_sorted(const  Type  *f , ulong  n) 

3 //  Return  whether  the  sequence  f [0]  , f[l],  ....  f[n-l]  is  ascending. 

4 { 

5 for  (ulong  k=l;  k<n;  ++k)  if  ( f [k-1]  > f [k]  ) return  false; 

6 return  true ; 

7 } 

A test  for  descending  order  is 

1 template  Ctypename  Type> 

2 bool  is_f alling(const  Type  *f,  ulong  n) 

3 //  Return  whether  the  sequence  f [0]  , f[l],  ....  f[n-l]  is  descending. 

4 { 

5 for  (ulong  k=l;  k<n;  ++k)  if  ( f [k-1]  < f [k]  ) return  false; 

6 return  true ; 

7 } 

3.1.2  Quicksort 

The  quicksort  algorithm  is  given  in  m,  it  has  complexity  0(nlog(n))  (in  the  average  case).  It  does 
not  obsolete  the  simpler  schemes,  because  for  small  arrays  the  simpler  algorithms  are  usually  faster,  due 
to  their  minimal  bookkeeping  overhead. 

The  main  activity  of  quicksort  is  partitioning  the  array.  The  corresponding  routine  reorders  the  array 
and  returns  a pivot  index  p so  that  max(/o, . . . , fP-i)  < min(/p, . . . , /„_ i)  [FXT:  sort/sort. h : 

1 template  Ctypename  Type> 

2 ulong  partition(Type  *f,  ulong  n) 

3 { 

4 //  Avoid  worst  case  with  already  sorted  input: 

5 const  Type  v = median3(f  [0]  , f [n/2]  , f[n-l]); 

? ulong  i = OUL  - 1 ; 

8 ulong  j = n; 

9 while  ( 1 ) 


10 

11 

do  { ++i; 

} while  ( f[i]Cv  ) 

12 

13 

do  { — j; 

I while  ( f[j]>v  ) 

14 

if  ( iCj  ) 

swap2(f  [i]  , f [j] ) ; 

15 

16 

17  } 

} 

else 

return  j ; 

The  function  median3()  is  defined  in  [FXT:  sort/minmaxmed23.h|: 

1 template  Ctypename  Type> 

2 static  inline  Type  median3 (const  Type  &x,  const  Type  &y,  const  Type  &z) 

3 //  Return  median  of  the  input  values 

4 { return  xCy  ? (yCz  ? y : (xCz  ? z : x))  : (zCy  ? y : (zCx  ? z : x) ) ; I 

The  function  does  2 or  3 comparisons,  depending  on  the  input.  One  could  simply  use  the  element  f [0] 
as  pivot.  However,  the  algorithm  will  need  0{n 2)  operations  when  the  array  is  already  sorted. 

Quicksort  calls  partition  on  the  whole  array,  then  on  the  two  parts  left  and  right  from  the  partition 
index,  then  for  the  four,  eight,  etc.  parts,  until  the  parts  are  of  length  one.  Note  that  the  sub-arrays  are 
usually  of  different  lengths. 

1 template  Ctypename  Type> 

2 void  quick_sort (Type  *f , ulong  n) 

3 { 

4 if  ( nC=l  ) return; 

5 

6 ulong  p = partition(f,  n) ; 

7 ulong  In  = p + 1 ; 

8 ulong  rn  = n - In; 
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9 quick_sort  (f , In);  //  f [0]  ...  f[ln-l]  left 

10  quick_sort  (f +ln,  rn)  ; //  f [In]  ...  f[n-l]  right 

11  } 

The  actual  implementation  uses  two  optimizations:  Firstly,  if  the  number  of  elements  to  be  sorted  is  less 
than  a certain  threshold,  selection  sort  is  used.  Secondly,  the  recursive  calls  are  made  for  the  smaller  of 
the  two  sub-arrays,  thereby  the  stack  size  is  bounded  by  |~log2(n)]. 


1 template  Ctypename  Type> 

2 void  quick_sort (Type  *f , ulong  n) 

3 { 

4 start : 

5 if  ( n<8  ) //  parameter:  threshold  for  nonrecursive  algorithm 

6 { 

7 selection_sort (f , n) ; 

8 return; 


11  ulong  p = partition(f,  n) ; 

12  ulong  In  = p + 1 ; 

13  ulong  rn  = n - In; 

14 

15  if  ( ln>rn  ) //  recursion  for  shorter  sub-array 

16  { 

17  quick_sort(f+ln,  rn) ; //  f [In]  ...  f[n-l]  right 

18  n = In; 

19  > 

20  else 

21  1 

22  quick_sort (f , In);  //  f [0]  ...  f[ln-l]  left 

23  n = rn; 

24  f +=  In; 

25  } 

2^  goto  start; 

28  } 


The  quicksort  algorithm  will  be  quadratic  with  certain  inputs.  A clever  method  to  construct  such  inputs 
is  described  in  |247j . The  heapsort  algorithm  is  in-place  and  O (n  log(n))  (also  in  the  worst  case).  It  is 
described  in  section  |3.1.5  on  page  14~T|  Inputs  that  lead  to  quadratic  time  for  the  quicksort  algorithm 
with  median-of-3  partitioning  are  described  in  [257J.  The  paper  suggests  to  use  quicksort,  but  to  detect 
problematic  behavior  during  runtime  and  switch  to  heapsort  if  needed.  The  corresponding  algorithm  is 
called  introsort  (for  introspective  sorting). 


3.1.3  Counting  sort  and  radix  sort 

We  want  to  sort  an  n-element  array  F of  (unsigned)  8-bit  values.  A sorting  algorithm  which  involves 
only  2 passes  through  the  data  proceeds  as  follows: 

1.  Allocate  an  array  C of  256  integers  and  set  all  its  elements  to  zero. 

2.  Count:  for  k = 0,  1,  . . . , n — 1 increment  C[F[fc]]. 

Now  C[x]  contains  the  number  of  bytes  in  F with  the  value  x. 

3.  Set  r = 0.  For  j = 0,  1,  ...,  255 

set  k = C\j\,  then  set  the  elements  F[r],  F[r  + 1],  . . . , F[r  + k — 1]  to  j,  and  add  k to  r. 

For  large  values  of  n this  method  is  significantly  faster  than  any  other  sorting  algorithm.  Note  that  no 
comparisons  are  made  between  the  elements  of  F.  Instead  they  are  counted,  the  algorithm  is  the  counting 
sort  algorithm. 

It  might  seem  that  the  idea  applies  only  to  very  special  cases  but  with  a little  care  it  can  be  used  in  more 
general  situations.  We  modify  the  method  so  that  we  are  able  to  sort  also  (unsigned)  integer  variables 
whose  range  of  values  would  make  the  method  impractical  with  respect  to  a subrange  of  the  bits  in  each 
word.  We  need  an  array  G that  has  as  many  elements  as  F: 

1.  Choose  any  consecutive  run  of  b bits,  these  will  be  represented  by  a bit  mask  to.  Allocate  an  array 
C of  26  integers  and  set  all  its  elements  to  zero. 
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2.  Let  M be  a function  that  maps  the  ( 2b ) values  of  interest  (the  bits  masked  out  by  m)  to  the  range 
0,  1,  ....  2b-l. 

3.  Count:  for  k = 0,  1,  . . . , n — 1 increment  C[M(F[k})\. 

Now  C[x\  contains  how  many  values  of  M(F[.])  equal  x. 

4.  Cumulate:  for  j = 1,  2,  . . . , 2b  — 1 (second  to  last)  add  C[j  — 1]  to  C[j). 

Now  C[x]  contains  the  number  of  values  M(F[.])  less  than  or  equal  to  x. 

5.  Copy:  for  k = n — 1,  . . . , 2,  1,0  (last  to  first),  do  as  follows: 

set  x :=  M(F[k]),  decrement  C[x\,  set  i :=  C[x\,  and  set  G[i\  :=  F[.t]. 

A crucial  property  of  the  algorithm  is  that  it  is  stable:  if  two  (or  more)  elements  compare  equal  (with 
respect  to  a certain  bit-mask  m),  then  the  relative  order  between  these  elements  is  preserved. 


Input 

0: 

1: 

2: 

3: 

4: 

5: 

6: 

7: 

8: 

9: 


11111 . 11< 

1.  . . 

..  .1.1.1 
. .1. . .1. 

. .1.111K 
. .1111.  . 
.1.  .1.  .1 
.1.1.11. 
.11. . ,11< 
.111 


Counting  sort  wrt . 

m = 11 

0:  1.  . . 

1:  ..1111.. 

2:  .111 

3:  ..  .1.1.1 

4:  .1.  .1.  .1 

5:  . .1. . .1. 

6:  .1.1.11. 

7:  11111. 11< 

8:  ..1.111K 

9:  .11. ,.11< 


two  lowest  bits 


The  relative  order  of  the  three  words  ending  with  two  set  bits  (marked  with  ‘<’)  is  preserved. 

A routine  that  verifies  whether  an  array  is  sorted  with  respect  to  a bit  range  specified  by  the  variable  bO 
and  m is  [FXT:  sort/radixsort.cc  : 

1 bool 

2 is_counting_sorted(const  ulong  *f , ulong  n,  ulong  bO,  ulong  m) 

3 //  Whether  f []  is  sorted  wrt.  bits  bO, . . . ,b0+z-l 

4 //  where  z is  the  number  of  bits  set  in  m. 

5 //  m must  contain  a single  run  of  bits  starting  at  bit  zero. 

6 { 

7 m «=  bO; 

8 for  (ulong  k=l;  k<n;  ++k) 

9 { 

10  ulong  xm  = (f [k-1]  & m ) >>  bO; 

11  ulong  xp  = (f  [k]  & m ) » bO; 

12  if  ( xm>xp  ) return  false; 

13  } 

14  return  true ; 

15  } 

The  function  M is  the  combination  of  a mask-out  and  a shift  operation.  A routine  that  sorts  according 
to  bO  and  m is: 

1 void 

2 counting_sort_core (const  ulong  * restrict  f,  ulong  n,  ulong  * restrict  g,  ulong  bO,  ulong  m) 

3 //  Write  to  g[]  the  array  f []  sorted  wrt.  bits  bO , . . . ,b0+z-l 

4 //  where  z is  the  number  of  bits  set  in  m. 

5 //  m must  contain  a single  run  of  bits  starting  at  bit  zero. 

6 { 


7 

ulong  nb  = m + 1 ; 

8 

m «=  bO; 

9 

ALL0CA(ulong,  cv,  nb) ; 

10 

for  (ulong  k=0;  k<nb;  ++k) 

cv  [k]  = 0 ; 

11 

12 

// count : 

13 

for  (ulong  k=0;  k<n;  ++k) 

14 

{ 

15 

ulong  x = (f  [k]  & m ) » 

bO; 

16 

++cv  [ x ] ; 

17 

> 

18 

19 

//  cumulative  sums: 

20 

for  (ulong  k=l;  k<nb;  ++k) 

cv[k]  +=  cv[k-l] 

21 

22 

//  reorder: 

23 

ulong  k = n; 

24 

while  ( k — ) //  backwards 

==>  stable  sort 

25 

{ 
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26 

ulong  f k = f [k] ; 

27 

ulong  x = (fk  & m)  » bO; 

28 

— cv  [x]  ; 

29 

ulong  i = cv  [x]  ; 

30 

g[i]  = fk; 

31 

32  } 

> 

Input 

Stage  1 

Stage  2 

Stage  3 

m = 11 

m = . .11.  . 

m = 11 

vv 

VV 

VV 

111.11 

. .1. . . 

11 

. .1. . . 

. .1.  . . 

1111. . 

1. . . 1. 

. .1.  .1 

.1.1.1 

11 

1. . .11 

.1.1.1 

1. . .1. 

.1.1.1 

.1.1.1 

.1.11. 

1.1111 

. .1.  .1 

.1.11. 

1. . .1. 

1111.  . 

1. . .1. 

. .1. . . 

1. . .11 

. .1.  .1 

.1.11. 

. .1.  .1 

1.1111 

.1.11. 

111.11 

111.11 

11 

1. . .11 

1.1111 

1111. . 

111.11 

11 

1. . .11 

1.1111 

1111. . 

Figure  3.1-B:  Radix  sort  of  10  six-bit  values  when  using  two-bit  masks. 


Now  we  can  apply  counting  sort  to  a set  of  bit  masks  that  cover  the  whole  range.  Figure  [3. 1-B| shows  an 
example  with  10  six-bit  values  and  3 two-bit  masks,  starting  from  the  least  significant  bits.  This  is  the 
output  of  the  program  [FXT:  sort/radixsort-demo.cc  . 

The  following  routine  uses  8-bit  masks  to  sort  unsigned  integers  [FXT:  sort/radixsort.cc  : 

1 void 

2 radix_sort (ulong  *f , ulong  n) 

3 { 

4 ulong  nb  = 8;  //  Number  of  bits  sorted  with  each  step 

5 ulong  tnb  = BITS_PER_L0NG;  //  Total  number  of  bits 

f ulong  *fi  = f; 

8 ulong  *g  = new  ulong  [n] ; 

9 

10  ulong  m = (lUL<<nb)  - 1; 

11  for  (ulong  k=l,  b0=0;  bO<tnb;  ++k,  bO+=nb) 

12  { 

13  counting_sort_core (f , n,  g,  bO,  m) ; 

14  swap2(f,  g) ; 

15  > 

16 

17  if  ( f!=fi  ) //  result  is  actually  in  g[] 

18  -C 

19  swap2(f,  g) ; 

20  for  (ulong  k=0;  k<n;  ++k)  f [k]  = g [k] ; 

21  > 

22 

23  delete  []  g; 

24  } 

There  is  room  for  optimization.  Combining  copying  with  counting  for  the  next  pass  (where  possible) 
would  reduce  the  number  of  passes  almost  by  a factor  of  2. 

A version  of  radix  sort  that  starts  from  the  most  significant  bits  is  given  in  |HM|. 


3.1.4  Merge  sort 

The  merge  sort  algorithm  is  a method  for  sorting  with  complexity  O (n  log(n)).  We  need  a routine 
that  copies  two  sorted  arrays  A and  B into  an  array  T such  that  T is  in  sorted  order.  The  following 
implementation  requires  that  A and  B are  adjacent  in  memory  [FXT:  sort /merge-sort. h : 

1 template  Ctypename  Type> 

2 void  merge(Type  * const  restrict  f,  ulong  na,  ulong  nb,  Type  * const  restrict  t) 

3 //  Merge  the  (sorted)  arrays 

4 //  A []  :=  f [0]  , f [1]  , ...,  f[na-l]  and  B []  :=  f [na]  , f[na+l],  ....  f [na+nb-1] 

5 II  into  t []  :=  t [0]  , t [ 1] t[na+nb-l]  such  that  t []  is  sorted. 

6 //  Must  have : na>0  and  nb>0 
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Figure  3.1-C:  Sorting  with  the  merge  sort  algorithm. 


7 { 


8 

const 

Type  * const  A = 

f ; 

9 

const 

Type  * const  B = 

f + na; 

10 

ulong  nt  = na  + nb; 

11 

Type 

i ta  = A [ — na]  , tb 

= B [ — nb] ; 

12 

13 

while 

( true  ) 

14 

15 

if 

( ta  > tb  ) // 

copy  ta 

16 

{ 

17 

t[ — nt]  = ta; 

18 

if  ( na==0  ) 

//  A []  empty? 

19 

{ 

20 

for  (ulong 

j =0 ; j <=nb ; ++j) 

t[j]  = B [j] 

21 

return; 

22 

} 

23 

24 

ta  = A [ — na]  ; 

//  read  next  element  of  A [] 

25 

} 

26 

else  //  copy  tb 

27 

{ 

28 

t[ — nt]  = tb; 

29 

if  ( nb==0  ) 

//  B []  empty? 

30 

{ 

31 

for  (ulong 

j =0 ; j<=na;  ++j) 

t[j]  = A [j] 

32 

return; 

33 

} 

34 

35 

tb  = B [ — nb] ; 

//  read  next  element  of  B [] 

36 

} 

37 

> 

38  } 

//  copy  rest  of  B [] 


//  copy  rest  of  A [] 


Two  branches  are  involved,  the  unavoidable  branch  with  the  comparison  of  the  elements,  and  the  test 
for  empty  array  where  an  element  has  been  removed. 

We  could  sort  by  merging  adjacent  blocks  of  growing  size  as  follows: 


[hgfedcba] 
[ghef  cdab] 
[efghabcd] 
[abcdefgh] 


//  input 
//  merge  pairs 

//  merge  adjacent  runs  of  two 
//  merge  adjacent  runs  of  four 


For  a more  localized  memory  access,  we 


in  section  34.1.1.1 


on  page 


651): 


use  a depth  first  recursion  (compare  with  the  binsplit  recursion 


1 template  Ctypename  Type> 

2 void  merge_sort_rec (Type  *f , ulong  n,  Type  *t) 

3 { 

4 if  ( n<8  ) 

5 { 

6 selection_sort (f , n) ; 

7 return; 

8 > 

l8  const  ulong  na  = n»l; 

11  const  ulong  nb  = n - na; 

12 
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13  //  PRINT  f [0]  , f[l],  ....  f [na-1] 

14  merge_sort_rec (f , na,  t) ; 

15  //  PRINT  f [na]  , f [na+1]  , . . . , f [na+nb-1] 

16  merge_sort_rec (f +na,  nb,  t); 

17 

18  merged,  na,  nb,  t); 

19  for  (ulong  j =0 ; j<n;  ++j)  f [ j ] = t[j];  //  copy  back 

20  //  PRINT  f [0] , f [1] , ....  f [na+nb-1] 

21  } 


The  comments  PRINT  indicate  the  print  statements  in  the  program  [FXT:  sort/merge-sort-demo. cc  that 
was  used  to  generate  figure  3.1-C  The  method  is  (obviously)  not  in-place.  The  routine  called  by  the  user 
is 


1 template  Ctypename  Type> 

2 void  merge_sort (Type  *f , ulong  n,  Type  *tmp=0) 

3 { 

4 Type  *t  = tmp; 

5 if  ( tmp==0  ) t = new  Type  [n] ; 

6 merge_sort_rec (f , n,  t) ; 

7 if  ( tmp==0  ) delete  []  t; 

8 } 


Optimized  algorithm 
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Figure  3.1-D:  Sorting  with  the  4-way  merge  sort  algorithm. 


The  copying  from  T to  F in  the  recursive  routine  can  be  avoided  by  a 4-way  splitting  scheme.  We  sort 
the  left  two  quarters  and  merge  them  into  T,  then  we  sort  the  right  two  quarters  and  merge  them  into 
T + na-  Then  we  merge  T and  T + na  into  F.  Figure  [3 . 1 - D | shows  an  example  where  only  one  recursive 
step  is  involved.  It  was  generated  with  the  program  [FXT:  sort/merge-sort4-demo.cc  . The  recursive 
routine  is  [FXT:  sort /merge-sort. h 

1 template  Ctypename  Type> 

2 void  merge_sort_rec4(Type  *f , ulong  n.  Type  *t) 

3 { 

4 if  ( n<8  ) //  threshold  must  be  at  least  8 

5 { 

6 selection_sort (f , n) ; 

7 return; 

8 } 

9 

10  //  left  and  right  half: 

11  const  ulong  na  = n»l; 

12  const  ulong  nb  = n - na; 

13 

14  //  left  quarters: 

15  const  ulong  nal  = na»l; 

16  const  ulong  na2  = na  - nal ; 

17  merge_sort_rec4(f , nal,  t) ; 

18  merge_sort_rec4(f+nal , na2,  t) ; 

19 

20  //  right  quarters: 

21  const  ulong  nbl  = nb»l; 

22  const  ulong  nb2  = nb  - nbl; 

23  merge_sort_rec4(f+na,  nbl,  t) ; 

24  merge_sort_rec4(f+na+nbl , nb2,  t) ; 

25 

26  //  merge  quarters  (F — >T) : 

27  merge (f,  nal,  na2,  t) ; 

28  merge(f+na,  nbl,  nb2,  t+na) ; 

29 
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30 

31 

32  } 
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//  merge  halves  (T — >F) : 
merge (t,  na,  nb,  f) ; 


The  routine  called  by  the  user  is  merge_sort4() . 


3.1.5  Heapsort 


The  heapsort  algorithm  has  complexity  0(nlog(n)). 
4.5.2|  on  page 


section 


160 


It  uses  the  heap  data  structure  introduced  in 
A heap  can  be  sorted  by  swapping  the  first  (and  biggest)  element  with  the 


last  and  restoring  the  heap  property  for  the  array  of  size  n — 1.  Repeat  until  there  is  nothing  more  to 
sort  [FXT:  sort/heapsort.h|: 


1 template  Ctypename  Type> 

2 void  heap_sort (Type  *x,  ulong  n) 

3 { 

4 build_heap(x,  n) ; 

5 Type  *p  = x - 1 ; 

6 for  (ulong  k=n;  k> 1 ; — k) 

7 { 

8 swap2(p[l],  p [k]  ) ; //  move  largest  to  end  of  array 

9 — n;  //  remaining  array  has  one  element  less 

10  heapify(p,  n,  1);  //  restore  heap-property 

11  } 

12  } 


Sorting  into  descending  order  is  not  any  harder: 

1 template  Ctypename  Type> 

2 void  heap_sort_descending(Type  *x,  ulong  n) 

3 //  Sort  x[]  into  descending  order. 

4 { 


5 

build_heap(x,  n) ; 

6 

Type 

i *p  = x - 1 ; 

7 

for 

(ulong  k=n;  k> 1 ; - 

-k) 

8 

{ 

9 

++p;  — n; 

//  remaining  array  has  one  element  less 

10 

heapify(p,  n,  1); 

//  restore  heap-property 

11 

> 

12 

} 

A program  that  demonstrates  the  algorithm  is  [FXT:  sort/heapsort-demo.cc  . 


3.2  Binary  search 


Searching  for  an  element  in  a sorted  array  can  be  done  in  0(log(n))  operations.  The  binary  search 
algorithm  uses  repeated  subdivision  of  the  data  [FXT:  sort/bsearch.h  : 

^ template  Ctypename  Type> 

3 ulong  bsearch(const  Type  *f , ulong  n,  const  Type  v) 

4 //  Return  index  of  first  element  in  f []  that  equals  v 

5 //  Return  n if  there  is  no  such  element . 

6 //  f []  must  be  sorted  in  ascending  order. 

7 //  Must  have  n!=0 

8 { 

9 ulong  nlo=0,  nhi=n-l; 

10  while  ( nlo  !=  nhi  ) 

11  { 

12  ulong  t = (nhi+nlo)/2; 

13 

14  if  ( f [t]  C v ) nlo  = t + 1 ; 

15  else  nhi  = t; 

16  > 

17 

18  if  ( f [nhi] ==v  ) return  nhi ; 

19  else  return  n; 

20  } 

Only  simple  modifications  are  needed  to  search,  for  example,  for  the  first  element  greater  than  or  equal 
to  a given  value: 
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1 template  Ctypename  Type> 

2 ulong  bsearch_geq(const  Type  *f , ulong  n,  const  Type  v) 

3 { 

4 ulong  nlo=0,  nhi=n-l; 

5 while  ( nlo  ! = nhi  ) 

6 { 

7 ulong  t = (nhi+nlo)/2; 


9 

if  ( f [t]  c v 

) nlo 

= t + 

10 

else 

nhi 

= t; 

11 

} 

12 

13 

if  ( f [nhi]  >=v  ) 

return 

nhi ; 

14 

else 

return 

n; 

15  } 

For  very  large  arrays  the  algorithm  can  be  improved  by  selecting  the  new  index  t different  from  the 
midpoint  (nhi+nlo)/2,  depending  on  the  value  sought  and  the  distribution  of  the  values  in  the  array.  As 
a simple  example  consider  an  array  of  floating-point  numbers  that  are  equally  distributed  in  the  interval 
[min(u),  max(u)].  If  the  sought  value  equals  v,  one  starts  with  the  relation 


n — rnin(n) 
max(n)  — min(n) 


v — min(u) 
max(u)  — rnin(u) 


(3.2-1) 


where  n denotes  an  index  and  min(n), max(n)  denote  the  minimal  and  maximal  index  of  the  current 
interval.  Solving  for  n gives  the  linear  interpolation  formula 


n 


, \ max(n)  — min(n) 

mm  (n)  + -A 

rnax(u)  — imn(u) 


( v — min(u)) 


(3.2-2) 


The  corresponding  interpolation  binary  search  algorithm  would  select  the  new  subdivision  index  t ac- 
cording to  the  given  relation.  One  could  even  use  quadratic  interpolation  schemes  for  the  selection  of  t. 
For  the  majority  of  practical  applications  the  midpoint  version  of  the  binary  search  will  be  good  enough. 

Approximate  matches  are  found  by  the  following  routine  [FXT:  sort/bsearchapprox.h  : 

1 template  Ctypename  Type> 

2 ulong  bsearch_approx(const  Type  *f,  ulong  n,  const  Type  v,  Type  da) 

3 //  Return  index  of  first  element  x in  f []  for  which  I (x-v) I <=  da 

4 //  Return  n if  there  is  no  such  element. 

5 //  f []  must  be  sorted  in  ascending  order. 

6 //  da  must  be  positive. 

7 // 

8 //  Makes  sense  only  with  inexact  types  (float  or  double) . 

9 //  Must  have  n!=0 

10  { 

11  ulong  k = bsearch_geq(f , n,  v-da) ; 

12  if  ( k<n  ) k = bsearch_leq(f +k,  n-k,  v+da) ; 

13  return  k; 

14  } 


3.3  Variants  of  sorting  methods 

Some  practical  variants  of  sorting  algorithms  are  described,  like  sorting  index  arrays,  pointer  sorting,  and 
sorting  with  a supplied  comparison  function. 

3.3.1  Index  sorting 

With  normal  sorting  we  order  the  elements  of  an  array  / so  that  f[k\  < f[k  + 1],  The  index-sort 
routines  order  the  indices  in  an  array  x so  that  the  sequence  /[#[&]]  is  in  ascending  order,  we  have 
/[a:[fc]]  < f[x[k  + 1]].  The  implementation  for  the  selection  sort  algorithm  is  [FXT:  sort/sortidx.h  : 

1 template  Ctypename  Type> 

2 void  idx_selection_sort (const  Type  *f , ulong  n,  ulong  *x) 

3 //  Sort  x[]  so  that  the  sequence  f[x[0]],  f[x[l]],  ...  f[x[n-l]]  is  ascending. 

4 //  Algorithm  is  0(n*n),  use  for  short  arrays  only. 
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for  (ulong  i=0;  i<n;  ++i) 

1 

Type  v = f [x[i]]  ; 

ulong  m = i;  //  position-ptr  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( f[x[j]]<v  ) 

{ 

m = j; 

v = f [x  [m]  ] ; 

} 

} 

swap2(x[i],  x [m]  ) ; 

} 

} 


The  verification  code  is 

1 template  <typename  Type> 

2 bool  is_idx_sorted(const  Type  *f , ulong  n,  const  ulong  *x) 

3 //  Return  whether  the  sequence  f [x  [0]  ] , f[x[l]],  ...  f[x[n-l]]  is  ascending  order. 

1 { 

5 for  (ulong  k=l;  k<n;  ++k)  if  ( f[x[k-l]]  > f[x[k]]  ) return  false; 

6 return  true ; 

7 } 


The  transformation  of  the  partitionO  routine  is  straightforward: 


1 template  Ctypename  Type> 

2 ulong  idx_partition(const  Type  *f , ulong  n,  ulong  *x) 

3 //  rearrange  index  array,  so  that  for  some  index  p 

4 //  max(f  [x[0]]  ...  f[x[p]])  <=  min(f  [x  [p+1]  ] ...  f[x[n-l]]) 

5 { 

6 //  Avoid  worst  case  with  already  sorted  input: 

7 const  Type  v = median3(*x  [0]  , *x[n/2],  *x[n-l],  cmp)  ; 


ulong  i = OUL  - 1 ; 


10 

ulong  j = n; 

11 

while  ( 1 ) 

12 

13 

do  ++i ; 

14 

while  ( f [x  [i]  ] <v  ); 

do  — j ; 

17 

18 

while  ( f[x[j]]>v  ); 

19 

if  ( i<j  ) swap2(x[i]  , x [ j ] ) ; 

20 

else  return  j ; 

21 

22  } 

> 

The  index-quicksort  itself  deserves  a minute  of  contemplation  comparing  it  to  the  plain  version: 
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template  Ctypename  Type> 

void  idx_quick_sort (const  Type  *f , ulong  n,  ulong  *x) 

//  Sort  x[]  so  that  the  sequence  f[x[0]],  f[x[l]],  ...  f[x[n-l]]  is  ascending. 

{ 

start : 

if  ( n<8  ) //  parameter:  threshold  for  nonrecursive  algorithm 

{ 

idx_selection_sort (f , n,  x) ; 
return; 

} 

ulong  p = idx_partition(f , n,  x) ; 
ulong  In  = p + 1 ; 
ulong  rn  = n - In; 

if  ( ln>rn  ) //  recursion  for  shorter  sub-array 

{ 

idx_quick_sort  (f  , rn,  x+ln)  ; //  f [x  [In]  ] ...  f[x[n-l]]  right 

n = In; 

} 

else 

{ 
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23 


26 

27 


30 


idx_quick_sort (f , In,  x) ; //  f [x  [0] ] 

n = rn; 
x +=  In; 

} 

goto  start; 


. . . f [x  [ln-1]  ] left 


Note  that  the  index-sort  routines  work  perfectly  for  non-contiguous  data.  The  index-analogues 
binary  search  algorithms  are  again  straightforward,  they  are  given  in  [FXT:  sort/bsearchidx.h|. 


The  sorting  routines  do  not  change  the  array  /,  the  actual  data  is  not  modified.  To  bring  / into 


order,  apply  the  inverse  permutation  of  x to  / (see  section  2.4  on  page  109): 


apply_inverse_permutation(x,  f,  n) ; 


To  copy  / in  sorted  order  into  g,  use: 


apply_inverse_permutation(x,  f,  n,  g) ; 


of  the 


sorted 


Input : 

After 

sort_by_key (f , n,  key,  1): 

f [] 

key  [] 

f □ 

key  [] 
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7 

Figure  3.3-A:  Sorting  an  array  according  to  an  array  of  keys. 


The  array  x can  be  used  for  sorting  by  keys , see  figure  3.3-A 


The  routine  is  [FXT:  sort/sortbykey.h  : 


1 template  Ctypename  Typel,  typename  Type2> 

2 void  sort_by_key (Typel  *f , ulong  n,  Type2  *key,  bool  skq=true) 

3 //  Sort  f []  according  to  key[]  in  ascending  order: 

4 //  f [k]  precedes  f[j]  if  key  [k]  <key  [j]  . 

5 //  If  skq  is  true  then  key []  is  also  sorted. 

6 { 

7 ALLOCA (ulong,  x,  n) ; 

8 for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

9 idx_quick_sort(key,  n,  x) ; 

10  apply_inverse_permutation(x,  f,  n) ; 

11  if  ( skq  ) apply_inverse_permutation(x,  key,  n) ; 

12  } 


3.3.2  Pointer  sorting 


Pointer  sorting  is  similar  to  index  sorting.  The  array  of  indices  is  replaced  by  an  array  of  pointers  [FXT : 
sort/sortptr.h  : 
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template  Ctypename  Type> 

void  ptr_selection_sort (/*const  Type  *f,*/  ulong  n,  const  Type  **x) 

//  Sort  x[]  so  that  the  sequence  *x[0],  *x[l],  ...,  *x[n-l]  is  ascending. 

{ 

for  (ulong  i=0;  i<n;  ++i) 

1 

Type  v = *x  [i]  ; 

ulong  m = i;  //  position-ptr  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( *x[j]<v  ) 

{ 

m = j; 
v = *x  [m]  ; 

} 

} 

swap2(x[i],  x [m]  ) ; 
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} 

} 

The  first  argument  (const  Type  *f)  is  not  necessary  with  pointer  sorting,  it  is  indicated  as  a comment 
to  make  the  argument  structure  uniform.  The  verification  routine  is 

template  Ctypename  Type> 

bool  is_ptr_sorted(/*const  Type  *f,*/  ulong  n,  Type  const*const*x) 

//  Return  whether  the  sequence  *x[0],  *x[l],  *x[n-l]  is  ascending. 

{ 

for  (ulong  k=l;  k<n;  ++k)  if  ( *x[k-l]  > *x[k]  ) return  false; 
return  true ; 

} 

The  pointer  versions  of  the  search  routines  are  given  in  [FXT; 


sort/bsearchptr.h  . 


3.3.3  Sorting  by  a supplied  comparison  function 


The  routines  in  [FXT:  sort/sortfunc.h|  are  similar  to  the  C-quicksort  qsort  that  is  part  of  the  standard 
library.  A comparison  function  cmp  has  to  be  supplied  by  the  caller.  This  allows,  for  example,  sorting 
compound  data  types  with  respect  to  some  key  contained  within  them.  Citing  the  manual  page  for  qsort: 

The  comparison  function  must  return  an  integer  less  than,  equal  to,  or  greater  than 
zero  if  the  first  argument  is  considered  to  be  respectively  less  than,  equal  to,  or 
greater  than  the  second.  If  two  members  compare  as  equal,  their  order  in  the 
sorted  array  is  undefined. 

As  a prototypical  example  we  give  the  selection  sort  routine: 

template  Ctypename  Type> 

void  selection_sort (Type  *f , ulong  n,  int  (*cmp) (const  Type  &,  const  Type  &)) 

//  Sort  f []  (ascending  order)  with  respect  to  comparison  function  cmpO  . 

{ 

for  (ulong  i=0;  i<n;  ++i) 

{ 

Type  v = f [i]  ; 

ulong  m = i;  //  position  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( cmp(f  [j]  , v)  < 0 ) 

{ 

m = j; 
v = f [m]  ; 

} 

} 

swap2(f[i],  f [m]  ) ; 

> 

} 


The  other  routines  are  rather  straightforward  translations  of  the  (plain)  sort  analogues.  Replace  the 
comparison  operations  involving  elements  of  the  array  as  follows: 


(a  < b)  cmp(a,b)  < 0 

(a  > b)  cmp(a,b)  > 0 

(a  ==  b)  cmp(a,b)  ==  0 

(a  <=  b)  cmp(a,b)  <=  0 

(a  >=  b)  cmp(a,b)  >=  0 


The  verification  routine  is 

template  Ctypename  Type> 

bool  is_sorted(const  Type  *f , ulong  n,  int  (*cmp) (const  Type  &,  const  Type  &)) 
//  Return  whether  the  sequence  f [0]  , f [1]  , ....  f[n-l] 

//  is  sorted  in  ascending  order  with  respect  to  comparison  function  cmp() . 

{ 

for  (ulong  k=l;  kCn;  ++k)  if  ( cmp(f[k-l],  f [k] ) > 0 ) return  false; 
return  true ; 

} 


The  numerous  calls  to  cmpO  do  have  a negative  impact  on  the  performance.  With  C++  you  can  provide 
a comparison  ‘function’  for  a class  by  overloading  the  comparison  operators  c,  c,  c=,  >=,  and  ==  and  use 
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the  plain  sort  version.  That  is,  the  comparisons  are  inlined  and  the  performance  should  be  fine. 

3. 3. 3.1  Sorting  complex  numbers 

You  want  to  sort  complex  numbers?  Fine  with  me,  but  don’t  tell  your  local  mathematician.  To  see  the 
mathematical  problem,  we  ask  whether  i is  less  than  or  greater  than  zero.  Assuming  i > 0 it  follows  that 
i ■ i > 0 (we  multiplied  with  a positive  value)  which  is  — 1 > 0 and  that  is  false.  So,  is  i < 0?  Then  i-i  > 0 
(multiplication  with  a negative  value,  as  assumed),  thereby  — 1 > 0.  Oops!  The  lesson  is  that  there  is  no 
way  to  impose  an  order  on  the  complex  numbers  that  would  justify  the  usage  of  the  symbols  “<’  and  “>’ 
consistent  with  the  rules  to  manipulate  inequalities. 

Nevertheless  we  can  invent  a relation  for  sorting:  arranging  (sorting)  the  complex  numbers  according  to 
their  absolute  value  (modulus)  leaves  infinitely  many  numbers  in  one  ‘bucket’,  namely  all  those  that  have 
the  same  distance  from  zero.  However,  one  could  use  the  modulus  as  the  major  ordering  parameter,  the 
argument  (angle)  as  the  minor.  Or  the  real  part  as  the  major  and  the  imaginary  part  as  the  minor.  The 
latter  is  realized  in 

1 static  inline  int 

2 cmp_complex (const  Complex  &f , const  Complex  &g) 

3 { 

4 const  double  fr  = f.realO,  gr  = g.realO; 

5 if  ( fr!=gr  ) return  (fr>gr  ? +1  : -1); 

6 

7 const  double  fi  = f.  imagO,  gi  = g.  imagO; 

8 if  ( fi!=gi  ) return  (fi>gi  ? +1  : -1); 

l8  return  0; 

11  } 

This  function,  when  used  as  comparison  with  the  following  routine,  can  indeed  be  the  practical  tool  you 
had  in  mind: 

1 void  complex_sort (Complex  *f , ulong  n) 

2 //  major  order  wrt . real  part 

3 //  minor  order  wrt . imag  part 

4 { 

5 quick_sort (f , n,  cmp_complex) ; 

6 } 

3. 3. 3. 2 Index  and  pointer  sorting 

The  index  sorting  routines  that  use  a supplied  comparison  function  are  given  in  [FXT:  sort/sortidxfunc.h  : 

i Ctypename  Type> 

:_selection_sort (const  Type  *f , ulong  n,  ulong  *x, 

int  (*cmp) (const  Type  &,  const  Type  &)) 
x[]  so  that  the  sequence  f [x  [0]  ] , f[x[l]],  ...  f[x[n-l]] 
icending  with  respect  to  comparison  function  cmpO  . 

(ulong  i=0;  i<n;  ++i) 

Type  v = f [x  [i]  ] ; 

ulong  m = i;  //  position-ptr  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( cmp(f  [x  [j]  ] , v)  < 0 ) 

{ 

m = j; 

v = f [x  [m]  ] ; 

} 

} 

swap2(x[i],  x [m]  ) ; 


The  verification  routine  is: 

1 template  <typename  Type> 

2 bool  is_idx_sorted(const  Type  *f , ulong  n,  const  ulong  *x, 

3 int  (*cmp) (const  Type  &,  const  Type  ft) ) 

4 //  Return  whether  the  sequence  f [x  [0]  ] , f[x[l]],  ...  f[x[n-l]]  is  ascending 
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//  with  respect  to  comparison  function  cmpO  . 

{ " 

for  (ulong  k=l;  k<n;  ++k)  if  ( cmp(f  [x [k-1] ] , f [x [k] ] ) > 0 ) return  false; 

return  true ; 

} 

The  pointer  sorting  versions  are  given  in  [FXT:  sort /sort ptrfunc.h 
template  Ctypename  Type> 

void  ptr_selection_sort (/*const  Type  *f,*/  ulong  n,  const  Type  **x, 

int  (*cmp) (const  Type  k,  const  Type  &)) 

//  Sort  x[]  so  that  the  sequence  *x[0],  *x[l],  *x[n-l] 

//  is  ascending  with  respect  to  comparison  function  cmpO  . 

{ 

for  (ulong  i=0;  i<n;  ++i) 

1 

Type  v = *x  [i]  ; 

ulong  m = i;  //  position-ptr  of  minimum 
ulong  j = n; 

while  ( — j > i ) //  search  (index  of)  minimum 

{ 

if  ( cmp(*x [j]  , v) <0  ) 

{ 

m = j; 
v = *x  [m]  ; 

} 

} 

swap2(x[i],  x [m]  ) ; 

> 

} 

The  verification  routine  is: 
template  Ctypename  Type> 

bool  is_ptr_sorted(/*const  Type  *f,*/  ulong  n.  Type  const*const*x , 
int  (*cmp) (const  Type  k,  const  Type  k) ) 

II  Return  whether  the  sequence  *x[0],  *x[l],  *x[n-l] 

II  is  ascending  with  respect  to  comparison  function  cmpO  . 

{ 

for  (ulong  k=l;  k<n;  ++k)  if  ( cmp(*x  [k-1] , *x [k] ) > 0 ) return  false; 
return  true ; 

} 

The  corresponding  versions  of  the  binary  search  algorithm  are  given  in  [FXT:  sort/bsearchidxfunc.h|  and 
[FXT:  sort/bsearchptrfunc.h  . 


3.4  Searching  in  unsorted  arrays 


To  find  the  first  occurrence  of  a certain  value  in  an  unsorted  array  use  the  routine  [FXT:  sort/usearch.h 

template  Ctypename  Type> 

inline  ulong  f irst_geq_idx (const  Type  *f , ulong  n,  Type  v) 

//  Return  index  of  first  element  ==  v 
//  Return  n if  all  ! =v 
{ 

ulong  k = 0; 

while  ( (kCn)  kk  (f  [k]  ! =v)  ) k++; 

return  k; 

} 

The  functions  f irst_neq_idx(),  f irst_geg_idx()  and  f irst_leq_idx()  find  the  first  occurrence  of 
an  element  unequal  (to  v),  greater  than  or  equal  and  less  than  or  equal,  respectively. 

If  the  last  bit  of  speed  matters,  one  could  use  a sentinel,  as  suggested  in  [210!  p.267] : 

template  Ctypename  Type> 

inline  ulong  f irst_eq_idx(/*  NOT  const  */  Type  *f , ulong  n.  Type  v) 

{ 

Type  s = f [n-1] ; 

f[n-l]  = v;  //  sentinel  to  guarantee  that  the  search  stops 

ulong  k = 0; 

while  ( f [k] ! =v  ) ++k; 
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8 f[n-l]  = s;  //  restore  value 

9 if  ( (k==n-l)  &&  (v!=s)  ) ++k; 

10  return  k; 

11  } 

There  is  only  one  branch  in  the  inner  loop,  this  can  give  a significant  speedup.  However,  the  technique 
is  only  applicable  if  writing  to  the  array  ‘f  []  ’ is  allowed. 

Another  way  to  optimize  the  search  is  partial  unrolling  of  the  loop: 

1 template  Ctypename  Type> 

2 inline  ulong  f irst_eq_idx_large (const  Type  *f , ulong  n,  Type  v) 

3 { 

4 ulong  k; 

5 for  (k=0;  k<(n&3);  ++k)  if  ( f [k] ==v  ) return  k; 

6 


7 

while  ( k!=n  ) 

//  4-fold 

unrolled 

8 

{ 

9 

Type  tO  = 

f [k]  , tl  = 

f [k+1]  , t2  = f [k+2]  , t3  = f [k+3]  ; 

10 

bool  qa  = 

( (tO==v)  | 

(tl==v)  );  //  note  bit-wise  OR  to  avoid  branch 

11 

bool  qb  = 

( (t2==v)  | 

(t3==v)  ) ; 

12 

if  ( qa  | 

qb  ) //  element  v found 

13 

{ 

14 

while 

( 1 ) {if 

( f[k]==v  ) return  k;  else  ++k;  } 

15 

} 

16 

k +=  4; 

17 

1 Q 

> 

return  n: 

20  } 

The  search  requires  only  two  branches  with  every  four  elements.  By  using  two  variables  qa  and  qb  better 
usage  of  the  CPU  internal  parallelism  is  attempted.  Depending  on  the  data  type  and  CPU  architecture 
8-fold  unrolling  may  give  a speedup. 

3.5  Determination  of  equivalence  classes 

Let  S be  a set  and  C :=  S x S the  set  of  all  ordered  pairs  (x,  y)  with  x,  y £ S.  A binary  relation  R on  S 
is  a subset  of  C . An  equivalence  relation  is  a binary  relation  with  the  following  properties: 

• reflexive:  x = x\/x. 

• symmetric:  x = y •£=>  y = x\/x,y. 

• transitive:  x = y,  y = z =>  x = z Vx,  y,  z. 

Here  we  wrote  x = y for  (x,  y)  £ R where  x,y  £ S. 

We  want  to  determine  the  equivalence  classes:  an  equivalence  relation  partitions  a set  into  1 < q < n 
subsets  Ei,  E2,  ■ ■ . , Eq  so  that  x = y whenever  both  x and  y are  in  the  same  subset  but  x ^ y if  x and 
y are  in  different  subsets. 

For  example,  the  usual  equality  relation  is  an  equivalence  relation,  with  a set  of  (different)  numbers  each 
number  is  in  its  own  class.  With  the  equivalence  relation  that  x = y whenever  x — y is  a multiple  of  some 
fixed  integer  m > 0 and  the  set  Z of  all  natural  numbers  we  obtain  m subsets  and  x = y if  and  only  if 
x = y mod  to. 

3.5.1  Algorithm  for  decomposition  into  equivalence  classes 

Let  S'  be  a set  of  n elements,  represented  as  a vector.  On  termination  of  the  following  algorithm  Q j.  = j 
if  j is  the  least  index  such  that  Sj  = Sk  (note  that  we  consider  the  elements  of  S to  be  in  a fixed  but 
arbitrary  order  here): 

1.  Put  each  element  in  its  own  equivalence  class:  Qk  :=  k for  all  0 < k < n 

2.  Set  k :=  1 (index  of  the  second  element). 
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3.  (Search  for  an  equivalent  element:) 

(a)  Set  j :=  0. 

(b)  If  Sk  = Sj  set  Qk  = Qj  and  goto  step  [4j 

(c)  Set  j :=  j + 1 and  goto  step  [3b] 

4.  Set  k :=  k + 1 and  if  k < n goto  step  [3j  else  terminate. 

The  algorithm  needs  n — 1 equivalence  tests  when  all  elements  are  in  the  same  equivalence  class  and 
n (n  — l)/2  equivalence  tests  when  each  element  is  alone  in  its  own  equivalence  class. 

In  the  following  implementation  the  equivalence  relation  must  be  supplied  as  a function  equiv_q()  that 
returns  true  when  its  arguments  are  equivalent  [FXT:  sort/equivclasses.h|: 

template  Ctypename  Type> 

void  equivalence_classes (const  Type  *s,  ulong  n,  bool  (*equiv_q) (Type, Type) , ulong  *q) 

//  Given  an  equivalence  relation  ’==’  (as  function  equiv_q()) 

//  and  a set  s []  with  n elements, 

//  write  to  q[k]  the  index  j of  the  first  element  s[j]  such  that  s[k]==s[j]. 

{ 

for  (ulong  k=0;  k<n;  ++k)  q[k]  = k;  //  each  in  own  class 
for  (ulong  k=l;  k<n;  ++k) 

{ 

ulong  j = 0; 

while  ( ! equiv_q(s [j] , s [k] ) ) ++j ; 

qlk]  = qljl  ; 

} 

} 


3.5.2  Examples  of  equivalence  classes 

3. 5. 2.1  Integers  modulo  m 

Choose  an  integer  m > 1 and  let  any  two  integers  a and  b be  equivalent  if  a — b is  an  integer  multiple 
of  to  (with  to  = 1 all  integers  are  in  the  same  class).  We  can  choose  the  numbers  0,  1 . . . , to  — 1 
as  representatives  of  the  to  classes  obtained.  Now  we  can  do  computations  with  those  classes  via  the 
modular  arithmetic  as  described  in  section  [39.1  on  page  764[  This  is  easily  the  most  important  example 
of  all  equivalence  relations. 

The  concept  also  make  sense  for  a real  (non-integral)  modulus  m > 0.  We  still  put  two  numbers  a and 
b into  the  same  class  if  a — b is  an  integer  multiple  of  to.  Finally,  the  modulus  m = 0 leads  to  the 
equivalence  relation  ‘equality’. 


3. 5. 2. 2 Binary  necklaces 


Consider  the  set  S of  n-bit  binary  words  with  the  equivalence  relation  in  which  two  words  x and  y are 
equivalent  if  and  only  if  there  is  a cyclic  shift  hk{x)  by  0 < k < n positions  such  that  hk{x)  = y.  The 
equivalence  relation  is  supplied  as  the  function  [FXT:  sort/equivclass-necklaces-demo.cc  : 

static  ulong  nb;  //  number  of  bits 
bool  n_equiv_q(ulong  x,  ulong  y)  //  necklaces 
{ 

ulong  d = bit_cyclic_dist(x,  y,  nb) ; 
return  (0==d) ; 

} 


The  function  bit_cyclic_dist  () 
list  of  equivalence  classes: 


is  given  in  section  1.13.4  on  page  32 


For  n 


4 we  find  the  following 


0 

1 

1.  . . 

[#=1] 
. 1 . . 

...  1 

.1. 

3 

1 . . 1 

11.  . 

. . 11 

11. 

5 

.1.1 

1.1. 

[#=2] 

7 

11.1 

111. 

1.11 

Ill 

15 

1111 

[#=1] 

of 

equivalence  classes  = 

6 

[#=4] 
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These  correspond  to  the  binary  necklaces  of  length  4.  One  usually  chooses  the  cyclic  minima  (or  maxima) 
among  equivalent  words  as  representatives  of  the  classes. 


3. 5. 2. 3 Unlabeled  binary  necklaces 


Same  set  but  the  equivalence  relation  is  defined  to  identify  two  words  x and  y when  there  is  a cyclic  shift 
hk(x)  by  0 < k < n positions  so  that  either  hk{x)  = y or  h).{x)  = y where  y is  the  complement  of  y: 


1 static  ulong  mm;  //  mask  to  complement 

2 bool  nu_equiv_q(ulong  x,  ulong  y)  //  unlabeled  necklaces 

3 { 

4 ulong  d = bit_cyclic_dist(x,  y,  nb) ; 

5 if  ( 0 ! =d  ) d = bit_cyclic_dist(mm~x,  y,  nb) ; 

6 return  (0==d) ; 

7 } 


With  n 
0: 

= 4 we 

1111 

find 

[#=2] 

1: 

111. 

ii!i 

1.11  1.  . . 

. Ill 

3: 

.11. 

l.  ,i 

11..  ..11 

[#=4] 

5: 

.1.1 

l.i. 

[#=2] 

# of  equivalence  classes  = 4 


. .1.  .1..  [#=8] 


These  correspond  to  the  unlabeled  binary  necklaces  of  length  4. 


3. 5. 2. 4 Binary  bracelets 


The  binary  bracelets  are  obtained  by  identifying  two  words  that  are  identical  up  to  rotation  and  possible 
reversal.  The  corresponding  comparison  function  is 

1 bool  b_equiv_q(ulong  x,  ulong  y)  //  bracelets 

2 { 

3 ulong  d = bit_cyclic_dist(x,  y,  b) ; 

4 if  ( 0 ! =d  ) d = bit_cyclic_dist(revbin(x,b) , y,  b) ; 

5 return  (0==d) ; 

6 } 

There  are  six  binary  bracelets  of  length  4: 


0:  ....  [#=1] 


1 

1.  . . 

.1.  . 

...  1 

. .1. 

[#=4] 

3 

5 

1 . . 1 
.1.1 

11.  . 
1.1. 

. . 11 
[#=2] 

. 11. 

[#= 4] 

7 

15 

11.1 

1111 

111. 

[#=1] 

1.11 

. Ill 

[#=4] 

The  unlabeled  binary  bracelets  are  obtained  by  additionally  allowing  for  bit-wise  complementation: 

1 bool  bu_equiv_q(ulong  x,  ulong  y)  //  unlabeled  bracelets 

2 { 

3 ulong  d = bit_cyclic_dist(x,  y,  b) ; 

4 x ~=  mm; 

5 if  ( 0 ! =d  ) d = bit_cyclic_dist(x,  y,  b) ; 

6 

7 x = revbin(x,b); 

8 if  ( 0 ! =d  ) d = bit_cyclic_dist(x,  y,  b) ; 

9 x "=  mm; 

10  if  ( 0 ! =d  ) d = bit_cyclic_dist(x,  y,  b) ; 

11 

12  return  (0==d) ; 

13  } 


There  are  four  unlabeled  binary  bracelets  of  length  4: 


0: 

1111 

[#=2] 

1: 

111. 

11.1 

1.11  1.  . . 

. Ill  . 

. . 1 

. .1. 

. 1 

3: 

.11. 

1.  .1 

11..  ..11 

[#= 4] 

5: 

.1.1 

1.1. 

[#=2] 

The  shown  functions  are  given  in  [FXT:  sort/equivclass-bracelets-demo.cc  which  can  be  used  to  produce 
listings  of  the  equivalence  classes. 


The  sequences  of  numbers  of  labeled  and  unlabeled  necklaces  and  bracelets  are  shown  in  figure  |3.5-A 
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n : 

N 

B 

N/U 

B/U 

[312]# 

A000031 

A000029 
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A000011 

1 

2 

2 

1 

1 
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3 

3 

2 

2 

3 

4 

4 
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2 

4 
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4 

5 

8 

8 

4 

4 

6 

14 

13 

8 

8 

7 

20 

18 

10 

9 

8 

36 

30 

20 

18 

9 

60 

46 

30 

23 

10 

108 

78 

56 

44 

11 

188 

126 

94 

63 

12 

352 

224 

180 

122 

13 

632 

380 

316 

190 

14 

1182 

687 

596 

362 

15 

2192 

1224 

1096 

612 

Figure  3.5-A:  The  number  of  binary  necklaces  ‘N’,  bracelets  ‘B’,  unlabeled  necklaces  ‘N/U’,  and  unlabeled 
bracelets  ‘B/U’.  The  second  row  gives  the  sequence  number  in  [312]. 


3. 5. 2. 5 Binary  words  with  reversal  and  complement 

The  set  S of  n-bit  binary  words  and  the  equivalence  relation  identifying  two  words  x and  y whenever 
they  are  mutual  complements  or  bit-wise  reversals. 
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.1.1 

6 

.11. 
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Figure  3.5-B:  Equivalence  classes  of  binary  words  where  words  are  identified  if  either  their  reversals  or 


complements  are  equal. 


For  example,  the  equivalence  classes  with  3-,  4-  and  5-bit  words  are  shown  in  figure  [T5-B|  The  sequence 
of  numbers  of  equivalence  classes  for  word-sizes  n is  (entry  A005418  in  [312]') 

n:  1,  2,  3,  4,  5,  6,  7,  8,  9,  10,  11,  12,  13,  14,  15,  16,  ... 

# : 1,  2,  3,  6,  10,  20,  36,  72,  136,  272,  528,  1056,  2080,  4160,  8256,  16512,  ... 

The  equivalence  classes  can  be  computed  with  the  program  [FXT:  sort/equivclass-bitstring-demo.cc  . 

We  have  chosen  examples  where  the  resulting  equivalence  classes  can  be  verified  by  inspection.  For 
example,  we  could  create  the  subsets  of  equivalent  necklaces  by  simply  rotating  a given  word  and  marking 
the  words  visited  so  far.  Such  an  approach,  however,  is  not  possible  if  the  equivalence  relation  does  not 
have  an  obvious  structure. 


3.5.3  The  number  of  equivalence  relations  for  a set  of  n elements 

We  write  B(n ) for  the  number  of  possible  partitionings  (and  thereby  equivalence  relations)  of  the  set 
{1,  2,  . . . , n}.  These  are  called  Bell  numbers.  The  sequence  of  Bell  numbers  is  entry  A000110  in  1312], 
it  starts  as  (n  > 1): 


00  -<J  05  OT  ^ CO  to 
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1,  2,  5,  15,  52,  203,  877,  4140,  21147,  115975,  678570,  4213597,  ... 

The  can  be  computed  easily  as  indicated  in  the  following  table: 

0:  [ 1] 

1:  [1,  2] 

2:  [ 2,  3,  5] 

3:  [ 5,  7,  10,  15] 

4:  [15,  20,  27,  37,  52] 

5:  [52,  67,  87,  114,  151,  203] 

n:  [B(n) , . . . ] 

The  first  element  in  each  row  is  the  last  element  of  the  previous  row,  the  remaining  elements  are  the  sum 
of  their  left  and  upper  left  neighbors.  As  GP  code: 

1 N=7 ; v=w=b=vector (N)  ; v[l]=l; 

{ f or (n=l ,N-1 , 
b[n]  = v[l]  ; 

print (n-1,  " , v) ; \\  print  row 

w[l]  = v [n]  ; 

f or  (k=2  ,n+l , w [k]  =w  [k-1]  +v  [k-1]  ) ; 
v=w; 

);  > 

An  implementation  in  C++  is  given  in  [FXT:  comb/bell-number-demo.cc|.  An  alternative  way  to  compute 
the  Bell  numbers  is  shown  in  section  [1 7. 2 on  page  358[ 
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Data  structures 


We  give  implementations  of  selected  data  structures  like  stack,  ring  buffer,  queue,  double-ended  queue 
(deque),  bit-array,  heap  and  priority  queue. 

4.1  Stack  (LIFO) 
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Figure  4.1-A:  Inserting  and  retrieving  elements  with  a stack. 

A stack  (or  LIFO,  for  last-in,  first-out ) is  a data  structure  that  supports  the  operations:  push.0  to 
save  an  entry,  popO  to  retrieve  and  remove  the  entry  that  was  entered  last,  and  peekO  to  retrieve 
the  element  that  was  entered  last  without  removing  it.  The  method  pokeO  modifies  the  last  entry.  An 
implementation  with  the  option  to  let  the  stack  grow  when  necessary  is  [FXT:  class  stack  in  ds/stack.h  : 

1 
2 

3 

4 

5 

6 


template  Ctypename  Type> 
class  stack 
{ 

public : 

Type  *x_ ; //  data 

ulong  s_ ; //  size 
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7 ulong  p_ ; //  stack  pointer  (position  of  next  write),  top  entry  @ p-1 

8 ulong  gq_;  //  grow  gq  elements  if  necessary,  0 for  "never  grow" 

1§  public: 

11  stack(ulong  n,  ulong  growq=0) 

12  1 

13  s_  = n; 

14  x_  = new  Type  [s_] ; 

15  p_  = 0;  //  stack  is  empty 

16  gq_  = growq; 

17  } 

18 

19  “stackO  { delete  []  x_;  1 

20 

21  ulong  numO  const  { return  p_;  } //  Return  number  of  entries. 

Insertion  and  retrieval  from  the  top  of  the  stack  are  implemented  as  follows: 
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ulong  push (Type  z) 

//  Add  element  z on  top  of  stack. 

//  Return  size  of  stack,  zero  on  stack  overflow. 
//If  gq_  is  nonzero  the  stack  grows  if  needed. 

{ 

if  ( p_  >=  s_  ) 

{ 

if  ( 0==gq_  ) return  0;  //  overflow 

growO  ; 

} 

x_  [p_]  = z ; 

++P_; 

return  s_ ; 

} 

ulong  pop (Type  &z) 

//  Retrieve  top  entry  and  remove  it. 

//  Return  number  of  entries  before  removing  element. 
//  If  empty  return  zero  and  leave  z is  undefined. 

{ 

ulong  ret  = p_; 

if  ( 0 ! =p_  ) { — p_ ; z = x_  [p_]  ; } 

return  ret ; 

> 

ulong  poke (Type  z) 

//  Modify  top  entry. 

//  Return  number  of  entries. 

//If  empty  return  zero  and  do  nothing. 

{ 

if  ( 0 ! =p_  ) x_[p_-l]  = z; 

return  p_ ; 

} 

ulong  peek (Type  &z) 

//  Read  top  entry,  without  removing  it. 

//  Return  number  of  entries. 

//If  empty  return  zero  and  leave  z undefined. 

■C 

if  ( 0 ! =p_  ) z = x_[p_-l]; 
return  p_ ; 

> 


The  growth  routine  is  implemented  as 

1 private: 

2 void  growO 

3 { 

4 ulong  ns  = s_  + gq_;  //  new  size 

5 x_  = ReAlloc<Type> (x_ , ns,  s_) ; 

6 s_  = ns ; 

7 > 

8 }; 


here  we  use  the  function  ReAllocQ  that  imports  the  C function  realloc (). 


'/,  man  realloc 


4.2:  Ring  buffer 
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#include  <stdlib.h> 

void  *realloc (void  *ptr,  size_t  size); 

realloc ()  changes  the  size  of  the  memory  block  pointed  to  by  ptr  to  size 
bytes.  The  contents  will  be  unchanged  to  the  minimum  of  the  old  and  new 
sizes;  newly  allocated  memory  will  be  uninitialized.  If  ptr  is  NULL,  the 
call  is  equivalent  to  malloc (size) ; if  size  is  equal  to  zero,  the  call  is 
equivalent  to  free(ptr).  Unless  ptr  is  NULL,  it  must  have  been  returned  by 
an  earlier  call  to  mallocO,  callocO  or  reallocO. 

A program  that  shows  the  working  of  the  stack  is  [FXT;  ds/stack-demo.cc  . An  example  output  where 
the  initial  size  is  4 and  the  growth-feature  enabled  (in  increments  of  4 elements)  is  shown  in  figure 


4.2  Ring  buffer 

A ring  buffer  is  an  array  together  with  read  and  write  operations  that  wrap  around.  That  is,  when  the 
last  position  of  the  array  is  reached,  writing  continues  at  the  begin  of  the  array,  thereby  erasing  the  oldest 
entries.  The  read  operation  starts  at  the  oldest  entry  in  the  array. 


4.1-A 


array  x 

[] 

X 

[] 

ordered 

n 

wpos 

fpos 

insert (A) 

A 

A 

i 

1 

0 

insert (B) 

A 

B 

A 

B 

2 

2 

0 

insert (C) 

A 

B 

C 

A 

B 

C 

3 

3 

0 

insert (D) 

A 

B 

C 

D 

A 

B 

C 

D 

4 

0 

0 

insert (E) 

E 

B 

C 

D 

B 

C 

D 

E 

4 

1 

1 

insert (F) 

E 

F 

C 

D 

C 

D 

E 

F 

4 

2 

2 

insert (G) 

E 

F 

G 

D 

D 

E 

F 

G 

4 

3 

3 

insert (H) 

E 

F 

G 

H 

E 

F 

G 

H 

4 

0 

0 

insert (I) 

I 

F 

G 

H 

F 

G 

H 

I 

4 

1 

1 

insert (J) 

I 

J 

G 

H 

G 

H 

I 

J 

4 

2 

2 

Figure  4.2-A:  Writing  to  a ring  buffer. 


Figure  [4.2- A|  shows  the  contents  of  a length-4  ring  buffer  after  insertion  of  the  symbols  ‘A’,  ‘B’,  . . . , ‘ J’ . 
The  listing  was  created  with  the  program  [FXT;  ds/ringbuffer-demo.cc|.  The  implementation  used  is 
[FXT:  class  ringbuffer  in  ds/ringbuffer.h  : 

template  Ctypename  Type> 
class  ringbuffer 
{ 

public : 

Type  *x_;  //  data  (ring  buffer) 

ulong  s_;  //  allocated  size  (#  of  elements) 

ulong  n_;  //  current  number  of  entries  in  buffer 

ulong  wpos_;  //  next  position  to  write  in  buffer 
ulong  fpos_;  //  first  position  to  read  in  buffer 

public : 

ringbuffer (ulong  n) 

{ 

s_  = n; 

x_  = new  Type  [s_] ; 
n_  = 0; 
wpos_  = 0; 
fpos_  = 0; 

3- 

"ringbuff er ()  { delete  []  x_;  } 

ulong  num()  const  { return  n_;  } 

If  an  entry  is  inserted,  it  is  written  to  index  wpos: 

1 void  insert (const  Type  &z) 

2 { 

3 x_[wpos_]  = z; 

4 if  ( ++wpos_>=s_  ) wpos_  = 0; 

5 if  ( n_  < s_  ) ++n_ ; 
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6 

else  fpos_  = wpos_; 

7 

1 

8 

9 

ulong  read (ulong  k,  Type  &z)  const 

10 

// 

Read  entry  k (that  is,  [(fpos_  + k)"/,s_]) 

11 

// 

Return  0 if  k>=n,  else  return  k+1. 

12 

1 

13 

if  ( k>=n_  ) return  0; 

14 

ulong  j = fpos_  + k; 

15 

if  ( j>=s_  ) j -=  s_; 

16 

z = x_[j]  ; 

17 

return  k + 1 ; 

18 

> 

19  }; 

Ring  buffers  are,  for  example,  useful  for  logging  purposes,  if  only  a certain  number  of  lines  can  be  saved. 
To  do  so,  enhance  the  ringbuffer  class  so  that  it  uses  an  additional  array  of  (fixed  width)  strings.  The 
message  to  log  is  copied  into  the  array  and  the  pointer  set  accordingly.  A read  returns  the  pointer  to  the 
string. 

4.3  Queue  (FIFO) 

A queue  (or  FIFO  for  first-in,  first-out)  is  a data  structure  that  supports  the  following  operations:  push() 
saves  an  entry,  popO  retrieves  (and  removes)  the  entry  that  was  entered  least  recently,  and  peekO 
retrieves  the  least  recently  entered  element  without  removing  it. 
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Figure  4. 3- A:  Inserting  and  retrieving  elements  with  a queue. 


We  describe  a queue  with  an  optional  feature  of  growing  when  necessary.  Figure  |4.3-A|  shows  the  data 
for  a queue  where  the  initial  size  is  four  and  the  growth- feature  enabled  (in  steps  of  four  elements).  The 
listing  was  created  with  the  program  [FXT:  ds/ queue- demo. cc  . 
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The  implementation  is  [FXT:  class  queue  in  ds/queue.h|: 

template  Ctypename  Type> 


2 

class  queue 

3 

{ 

4 

public : 

5 

Type  *x_; 

//  pointer  to  data 

6 

ulong  s_; 

//  allocated  size  (#  of  elements) 

7 

ulong  n_; 

//  current  number  of  entries  in  buffer 

8 

ulong  wpos_ 

//  next  position  to  write  in  buffer 

9 

ulong  rpos_ 

//  next  position  to  read  in  buffer 

10 

ulong  gq_ ; 

//  grow  gq  elements  if  necessary,  0 for 

w 

public : 

13 

explicit  queue (ulong  n,  ulong  growq=0) 

14 

■c 

15 

s_  = n; 

16 

x_  = new  Type  [s_] ; 

17 

n_  = 0 ; 

18 

wpos_  = 

0; 

19 

rpos_  = 

0; 

20 

gq_  = growq; 

21 

} 

22 

23 

'queue ()  { 

delete  []  x_;  } 

24 

25 

ulong  num() 

const  { return  n_;  } 

never  grow 


The  method  push()  writes  to  x[wpos],  peek()  and  pop()  read  from  x[rpos]: 


1 

ulong  push (const  Type  &z) 

2 

//  Return  number  of  entries. 

3 

//  Zero  is  returned  on  failure 

4 

//  (i.e.  space  exhausted  and  0==gq  ) 

5 

1 

6 

if  ( n >=  s ) 

7 

{ 

8 

if  ( 0==gq_  ) return  0;  //  growing  disabled 

9 

growQ  ; 

10 

} 

11 

12 

x_  [wpos_]  = z; 

13 

++wpos_ ; 

14 

if  ( wpos_>=s_  ) wpos_  = 0; 

++n_ ; 

17 

return  n ; 

18 

} 

19 

20 

ulong  peek(Type  &z) 

21 

//  Return  number  of  entries. 

22 

//if  zero  is  returned  the  value  of  z 

is  undefined. 

23 

{ 

24 

z = x_[rpos_]; 

25 

return  n ; 

26 

} 

27 

28 

ulong  pop (Type  &z) 

29 

//  Return  number  of  entries  before  pop 

30 

//  i.e.  zero  is  returned  if  queue  was 

empty . 

31 

//If  zero  is  returned  the  value  of  z 

is  undefined. 

32 

I 

33 

ulong  ret  = n_ ; 

34 

if  ( 0 ! =n  ) 

35 

{ 

36 

1 1 

1 

w 

0 

Oh 

1  1 

1 

X 

II 

N 

37 

++rpos_ ; 

38 

H- 

Hi 

•i 

0 

CO 

1 

V 

II 

CO 

1 

v — ' 

0 

CO 

1 

II 

o 

39 

— n ; 

40 

i 

i 

return  ret ; 

43 

} 

The  growing  feature  is  implemented  as  follows: 

1 private: 
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2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


void  growO 

{ 

ulong  ns  = s_  + gq_;  //  new  size 
//  move  read-position  to  zero: 
rotate_left (x_ , s_ , rpos_) ; 
x_  = ReAlloc<Type> (x_ , ns,  s_) ; 
wpos_  = s_; 
rpos_  = 0; 
s_  = ns ; 

} 

}; 


4.4  Deque  (double-ended  queue) 


A deque  (for  double-ended  queue)  combines  the  data  structures  stack  and  queue:  insertion  and  deletion 
in  time  0(1)  is  possible  both  at  the  first  and  the  last  position.  An  implementation  with  the  option  to  let 
the  deque  grow  when  necessary  is  [FXT:  class  deque  in  ds/deque.h 
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li 
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template  <typename  Type> 
class  deque 
{ 

public : 

Type  *x_;  //  data  (ring  buffer) 

ulong  s_;  //  allocated  size  (#  of  elements) 

ulong  n_;  //  current  number  of  entries  in  buffer 

ulong  fpos_;  //  position  of  first  element  in  buffer 
//  insert_f  irst  ()  will  write  to  (fpos-l)’/,n 

ulong  lpos_;  //  position  of  last  element  in  buffer  plus  one 
//  insert_last ()  will  write  to  lpos,  n==(lpos-fpos)  (mod  s) 

//  entries  are  at  [fpos,  ....  lpos-1]  (range  may  be  empty) 

ulong  gq_;  //  grow  gq  elements  if  necessary,  0 for  "never  grow" 

public : 

explicit  deque (ulong  n,  ulong  growq=0) 
s_  = n; 

x_  = new  Type [s_] ; 
n_  = 0; 
fpos_  = 0; 
lpos_  = 0; 
gq_  = growq; 

} 

“deque  ()  { delete  []  x_;  } 

ulong  numO  const  { return  n_;  } 


The  insertion  at  the  front  and  end  are  implemented  as 


ulong  insert_f irst (const  Type  &z) 


2 

// 

Return  number  of  entries  after  insertion. 

3 

// 

Zero  is  returned  on  failure 

4 

// 

(i.e.  space  exhausted  and  0==gq_) 

5 

{ 

6 

if  ( n >=  s ) 

7 

{ 

8 

if  ( 0==gq_  ) return  0 ; //  growing 

9 

growO  ; 

10 
1 1 

} 

Vi 

— fpos_ ; 

13 

if  ( fpos_  ==  -1UL  ) fpos_  = s_  - 1 ; 

14 

x_  [fpos _]  = z; 

15 

++n_ ; 

16 

return  n_ ; 

17 

> 

20 

ulong  insert_last (const  Type  &z) 

21 

// 

Return  number  of  entries  after  insertion. 

22 

// 

Zero  is  returned  on  failure 

23 

// 

(i.e.  space  exhausted  and  0==gq_) 
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24 

{ 

25 

if  ( n >=  s ) 

26 

{ 

27 

if  ( 0==gq_ 

28 

grow()  ; 

29 

} 

30 

31 

x_[lpos_]  = z; 

32 

++lpos_ ; 

33 

if  ( lpos_>=s_  ) 

34 

++n_ ; 

35 

return  n_ ; 

36 

> 

//  growing  disabled 


The  extraction  methods  are 


1 ulong  extract_f irst (Type  & z) 


2 

// 

Return  number  of  elements  before  extract. 

3 

// 

Return  0 if  extract  on  empty  deque  was  attempted 

4 

{ 

5 

if  ( 0==n_  ) return  0; 

6 

z = x_  [fpos_]  ; 

7 

++fpos_ ; 

8 

if  ( fpos_  >=  s_  ) fpos_  = 0; 

9 

— n_ ; 

10 

return  n_  + 1 ; 

11 

> 

12 

13 

ulong  extract_last (Type  & z) 

14 

// 

Return  number  of  elements  before  extract. 

15 

// 

Return  0 if  extract  on  empty  deque  was  attempted 

16 

{ 

17 

if  ( 0==n_  ) return  0; 

18 

— lpos_ ; 

19 

if  ( lpos_  ==  -1UL  ) lpos_  = s_  - 1; 

20 

z = x_  [lpos_]  ; 

21 

— n_ ; 

22 

return  n_  + 1 ; 

23 

> 

We  can  read  at  the  front,  end,  or  an  arbitrary  index,  without  changing  any  data: 
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ulong  read_f irst (Type  & z)  const 
//  Read  (but  don’t  remove)  first  entry. 

//  Return  number  of  elements  (i.e.  on  error  return  zero). 

1 

if  ( 0==n_  ) return  0; 
z = x_  [fpos_]  ; 
return  n_ ; 

> 

ulong  read_last (Type  & z)  const 
//  Read  (but  don’t  remove)  last  entry. 

//  Return  number  of  elements  (i.e.  on  error  return  zero). 

1 

return  read(n_-l,  z) ; //  ok  for  n_==0 

> 

ulong  read (ulong  k,  Type  & z)  const 
//  Read  entry  k (that  is,  [(fpos_  + k)"/,s_]). 

//  Return  0 if  k>=n_  else  return  k+1 

1 

if  ( k>=n_  ) return  0 ; 
ulong  j = fpos_  + k; 
if  ( j>=s_  ) j -=  s_ ; 
z = x_[j]  ; 
return  k + 1 ; 

> 

private : 

void  grow() 

{ 

ulong  ns  = s_  + gq_;  //  new  size 
//  Move  read-position  to  zero: 
rotate_left (x_ , s_ , fpos_) ; 
x_  = ReAlloc<Type> (x_ , ns,  s_) ; 
fpos_  = 0; 
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9 lpos_  = n_; 

10  s_  = ns  ; 

11  } 

12  }; 
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Figure  4.4-A:  Inserting  and  retrieving  elements  with  a queue. 


Its  working  is  shown  in  figure  4.4-A  which  was  created  with  the  program  [FXT:  ds/deque-demo.cc  . 


4.5  Heap  and  priority  queue 

4.5.1  Indexing  scheme  for  binary  trees 


1:  [.  . .1] 

2:  [.  .1.] 

3:  [.  .11] 

4 : [ . 1 . . ] 5 : [.1.1] 

6:  [.11.] 

7:  [.111] 

8:  [1.  . .]  9:  [1 . . 1] 

Figure  4.5-A:  Indexing  a binary  tree:  the  left  child  of  node  k is  node  2 fc,  the  right  child  is  node  2k  + 1. 


A one-based  index  array  with  n elements  can  be  identified  with  a binary  tree  as  shown  in  figure  |4.5-A 
Node  1 is  the  root  node.  The  left  child  of  node  k is  node  2k  and  the  right  child  is  node  2 k + 1.  The 
parent  of  node  k is  node  [k/2 J. 

We  require  that  consecutive  array  indices  1,2,  . . .,  n are  used.  Therefore  all  nodes  k where  k < \n/2\ 
have  at  least  one  child. 

4.5.2  The  binary  heap 

A binary  heap  is  a binary  tree  of  the  form  just  described,  where  both  children  are  less  than  or  equal  to 
their  parent.  Figure  |4.5-B|  shows  an  example  of  a heap  with  nine  elements. 

The  following  function  determines  whether  a given  array  is  a heap  [FXT:  ds/heap.li  : 

1 template  Ctypename  Type> 

2 ulong  test_heap(const  Type  *x,  ulong  n) 
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95 

91 

84 

79 

91 

80 

78 

76  71 

as  array : [ 95 , 

91,  84, 

79,  91, 

80,  78, 

76,  71] 

Figure  4.5-B:  A heap  with  nine  elements,  the  left  or  right  child  is  never  greater  than  the  parent. 


3 //  Return  0 if  x []  has  heap  property 

4 //  else  index  of  node  found  to  be  greater  than  its  parent. 

5 { 

6 const  Type  *p  = x - 1;  //  make  one-based 

7 for  (ulong  k=n;  k> 1 ; — k) 

8 { 

9 ulong  t = (k>> 1 ) ; //  parent (k) 

10  if  ( p [t] <p [k]  ) return  k— 1 ; //  in  {1,  2,  . ..,  nj 

11  } 

12  return  0;  //  has  heap  property 

13  } 

Let  L = 2k  and  R = 2k  + 1 be  the  left  and  right  children  of  node  k,  respectively.  Now  assume  that  the 
subtrees  whose  roots  are  L and  R already  have  the  heap  property,  but  node  k is  less  than  either  L or  R. 
We  can  restore  the  heap  property  between  k , L,  and  R by  swapping  element  k downwards  (with  L or  R, 
as  needed).  The  process  is  repeated  if  necessary  until  the  bottom  of  the  tree  is  reached: 

1 template  Ctypename  Type> 

2 void  heapify(Type  *z,  ulong  n,  ulong  k) 

3 //  Data  expected  in  z [1 , 2 , . . . ,n]  . 

4 { 

5 ulong  m = k;  //  index  of  max  of  k,  left(k),  and  right (k) 

6 

7 const  ulong  1 = (k«l)  ; //  left(k); 

8 if  ( (1  <=  n)  &&  (z [1]  > z [k] ) ) m = 1;  //  left  child  (exists  and)  greater  than  k 

9 

10  const  ulong  r = (k«l)  +1;  //  right  (k); 

11  if  ( (r  <=  n)  &&  (z[r]  > z [m] ) ) m = r;  //  right  child  (ex.  and)  greater  than  max(k,l) 

12 

13  if  ( m ! = k ) //  need  to  swap 

14  { 

15  swap2  (z  [k]  , z [m]  ) ; 

16  heapify(z,  n,  m) ; 

17  } 

18  } 

To  reorder  an  array  into  a heap,  we  restore  the  heap  property  from  the  bottom  up: 

1 template  Ctypename  Type> 

2 void  build_heap(Type  *x,  ulong  n) 

3 //  Reorder  data  to  a heap. 

4 //  Data  expected  in  x [0 , 1 , . . . ,n-l]  . 

5 { 

6 Type  *z  = x - 1 ; //  make  one-based 

7 ulong  j = (n»l)  ; //  max  index  such  that  node  has  at  least  one  child 

8 while  ( j > 0 ) 

9 { 

10  heapify(z,  n,  j); 

11  -j; 

12  } 

13  } 

The  routine  has  complexity  0(n).  Let  the  height  of  node  k be  the  maximal  number  of  swaps  that  can 
happen  with  heapify  (k) . There  are  less  than  n/2  elements  of  height  1,  n/ 4 of  height  2,  n/8  of  height  3, 
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and  so  on.  Let  W (n)  be  the  maximal  number  of  swaps  with  n elements,  we  have 

W(n)  < 1 n/2  + 2n/4  + 3n/8  + . . . + log2(n)  1 <2  n (4-5-1) 

So  the  complexity  is  indeed  linear. 

A new  element  can  be  inserted  into  a heap  in  O(logn)  time  by  appending  it  and  moving  it  towards  the 
root  as  necessary: 

1 template  Ctypename  Type> 

2 bool  heap_insert (Type  *x,  ulong  n,  ulong  s,  Type  t) 

3 //  With  x[]  a heap  of  current  size  n 

4 //  and  max  size  s (i.e.  space  for  s elements  allocated), 

5 //  insert  t and  restore  heap-property. 

6 //  Return  true  if  successful,  else  (i.e.  if  space  exhausted)  false. 

7 { 


8 

if  ( n > s ) return 

false ; 

9 

++n; 

10 

Type  *xl  = x - 1;  // 

make  one-based 

11 

ulong  j = n; 

12 

while  ( j > 1 ) //  move  towards  root 

13 

14 

ulong  k = (j>>l) ; 

//  k==parent(j) 

15 

if  ( xl [k]  >=  t ) 

break; 

16 

xl  [j]  = xl  [k]  ; 

17 

j = k; 

18 

> 

19 

xl  [j]  = t; 

20 

return  true ; 

21 

} 

Similarly,  the  maximal  element  can  be  removed  in  time  O(logn): 

1 template  Ctypename  Type> 

2 Type  heap_extract_max(Type  *x,  ulong  n) 

3 //  Return  maximal  element  of  heap  and  restore  heap  structure . 

4 //  Return  value  is  undefined  for  0==n. 

5 { 


6 

Type  m = x [0]  ; 

7 

if  (0  ! = n ) 

8 

9 

Type  *xl  = x - 

1; 

10 

xl  [1]  = xl  [n]  ; 

11 

— n; 

12 

heapify(xl,  n, 

1) 

13 

> 

14 

15  } 

return  m; 

4.5.3  Priority  queue 

A priority  queue  is  a data  structure  that  supports  insertion  of  an  element  and  extraction  of  its  maximal 
element,  both  in  time  O (log(n)).  A priority  queue  can  be  used  to  schedule  an  event  for  a certain  time 
and  return  the  next  pending  event. 

We  use  a binary  heap  to  implement  a priority  queue.  Two  modifications  seem  appropriate:  Firstly,  replace 
extract_max()  by  extract_next  () , leaving  it  as  a compile  time  option  whether  to  extract  the  minimal 
or  the  maximal  element.  We  need  to  change  the  comparison  operators  at  a few  strategic  places  so  that 
the  heap  is  built  either  with  its  maximal  or  its  minimal  element  first  [FXT:  class  priority_queue  in 
ds/priorityqueue.h  : 

1 #if  1 

2 //  nextO  is  the  one  with  the  smallest  key 

3 //i.e.  extract_next ()  is  extract_min() 

4 #def ine  _CMP_  < 

5 #def ine  _CMPEQ_  <= 

6 #else 

7 //  nextO  is  the  one  with  the  biggest  key 

8 //i.e.  extract_next ()  is  extract_max() 

9 #def ine  _CMP_  > 

10  #def ine  _CMPEQ_  >= 

11  #endif 


4.5:  Heap  and  priority  queue 
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Secondly,  augment  the  elements  by  an  event  description  that  can  be  freely  defined: 
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2 

3 
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5 

6 

7 

8 

9 

1? 

12 

13 

14 

15 

16 

1 1 

19 

20 
21 
22 

23 

24 

25 

26 
27 


template  Ctypename  Typel,  typename  Type2> 
class  priority_queue 


public : 

Typel  *tl 
Type2  *el 
ulong  s_; 
ulong  n 


//  time:  tl[l..s]  one-based  array! 
//  events:  el[l..s]  one-based  array! 
//  allocated  size  (#  of  elements) 

//  current  number  of  events 


ulong  gq_ ; //  grow  gq  elements  if  necessary,  0 fo 


"never  grow" 


public : 

priority_queue (ulong  n,  ulong  growq=0) 
t 

s_  = n; 

tl_  = new  Typel  [s_]  - 1; 
el_  = new  Type2[s_]  - 1; 

n_  = 0; 
gq_  = growq; 

> 

~priority_queue () 

■C 

delete  []  (tl_+l)  ; 
delete  []  (el_+l)  ; 

} 

[ — snip — ] 


The  extraction  and  insertion  operations  are 


1 bool  extract_next (Typel  &t , Type2  fee) 

if  ( n_  ==  0 ) return  false; 

t = tl_[l]  ; 
e = el_  [1]  ; 
tl_  [1]  = tl_  [n_]  ; 
el_  [1]  = el_  [n_]  ; 

— n_ ; 

heapify (1) ; 
return  true ; 


15  bool  insert (const  Typel  &t , const  Type2  fee) 

16  //  Insert  event  e at  time  t. 

17  //  Return  true  if  successful,  else  false  (space  exhausted  and  growth  disabled). 

18  { 

19 

20 
21 
22 
23 

i 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

i 

38  > 

39 

40  void  reschedule_next (Typel  t) 

41  1 

42 

43 

44  } 


if  ( n_  >=  s_  ) 

{ 

if  ( 0==gq_  ) return  false;  //  growing  disabled 
growO  ; 

> 

++n_ ; 

ulong  j = n_ ; 
while  ( j > 1 ) 

{ 

ulong  k = (j»l);  //  k==parent(j) 

if  ( tl_[k]  _CMPEQ_  t ) break; 
tl_[j]  = tl_  [k]  ; el_[j]  = el_  [k]  ; 
j = k; 

> 

tl_[j]  = t; 
el_  [j]  = e; 

return  true ; 


2 

3 

4 

5 

6 

7 

8 
9 

10 

u 

13 


1 1_  [1]  = t; 
heapify (1) ; 
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The  member  function  reschedule_next  ()  is  more  efficient  than  the  sequence  extract_next  () ; 
insert  ();,  as  it  calls  heapifyO  only  once.  The  heapifyO  function  is  tail-recursive,  so  we  make  it 
iterative: 
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private : 

void  heapif y (ulong  k) 

1 

ulong  m = k; 
tlS'td.T't  * 

ulong  1 = (k<<l) ; //  left(k); 
ulong  r = 1 + 1;  //  right (k); 

if  ( (1  <=  n_)  &&  (tl_  [1]  _CMP_  tl_[k])  ) m = 1; 
if  ( (r  <=  n_)  &&  (tl_[r]  _CMP_  tl_[m])  ) m = r; 

if  ( m ! = k ) 

{ 

swap2  (tl_  [k]  , tl_[m]);  swap2(el_  [k]  , el_[m]); 
//  heapify(m) ; 

k = m; 

goto  hstart;  //  tail  recursion 

} 

} 


The  second  argument  of  the  constructor  determines  the  number  of  elements  added  in  case  of  growth,  it 
is  disabled  (equals  zero)  by  default. 
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private : 

void  grow() 

ulong  ns  = s_  + gq_;  //  new  size 
tl_  = ReAlloc<Typel>(tl_+l , ns,  s_) 
el_  = ReAlloc<Type2>(el_+l , ns,  s_) 
s_  = ns; 


}; 


} 


l; 

l; 


The  ReAllocO  routine  is  described  in  section  4.1  on  page  153 


Inserting  into 

piority_queue : 

Extracting  from 

piority_queue : 

# 

event 

0 

time 

# 

event 

0 

time 

0 

A 

0.840188 
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0.394383 
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0.79844 
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0.394383 
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0.911647 
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0.911647 

Figure  4.5-C:  Insertion  of  events  labeled  ‘A’,  ‘B’,  ...,  ‘ J’  scheduled  for  random  times  into  a priority 


queue  (left)  and  subsequent  extraction  (right). 


The  program  [FXT:  ds/priorityqueue-demo.cc|  inserts  events  at  random  times  0 < t < 1,  then  extracts 
all  of  them.  It  gives  the  output  shown  in  figure  ph  5- C A more  typical  usage  would  intermix  the  insertions 
and  extractions. 


4.6  Bit-array 


The  use  of  bit-arrays  should  be  obvious:  an  array  of  tag  values  (like  ‘seen’  versus  ‘unseen’)  where  all 
standard  data  types  would  be  a waste  of  space.  Besides  reading  and  writing  individual  bits  one  should 
implement  a convenient  search  for  the  next  set  (or  cleared)  bit. 


The  class  [FXT:  class  bitarray  in  ds/bitarray.h|  is  used,  for  example,  for  lists  of  small  primes  [FXT: 
mod/primes. cc |,  for  in-place  transposition  routines  [FXT:  aux2 /transposed!  (see  section  2.8  on  page  122 1 
and  several  operations  on  permutations  (see  section  2.4  on  page  109 1. 


1 class  bitarray 


4.6:  Bit-array 
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//  Bit-array  class  mostly  for  use  as  memory  saving  array  of  Boolean  values. 
//  Valid  index  is  0...nb_-l  (as  usual  in  C arrays). 


public : 

ulong  *f_; 
ulong  n_; 
ulong  nfw_; 
ulong  mp_ ; 

//  (ones  are 
bool  myfq_; 

[ — snip — ] 


//  bit  bucket 
//  number  of  bits 

//  number  of  words  where  all  bits  are  used,  may  be  zero 
//  mask  for  partially  used  word  if  there  is  one,  else  zero 
at  the  positions  of  the  _unused_  bits) 

II  whether  f []  was  allocated  by  class 


The  constructor  allocates  memory  by  default.  If  the  second  argument  is  nonzero,  it  must  point  to  an 
accessible  memory  range: 
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bitarray (ulong  nbits,  ulong  *f=0) 
//  nbits  must  be  nonzero 
1 

ulong  nw  = ctor_core (nbits) ; 
if  ( f ! =0  ) 

{ 

f_  = (ulong  *)f; 
myfq_  = false; 

} 

else 

{ 

f_  = new  ulong  [nw] ; 
myfq_  = true; 

} 

} 


The  public  methods  are 


1 

// 

operations  on  bit  n: 

2 

ulong  test (ulong  n)  const 

// 

3 

void  set (ulong  n) 

// 

4 

void  clear (ulong  n) 

// 

5 

void  change (ulong  n) 

II 

6 

ulong  test_set (ulong  n) 

II 

7 

ulong  test_clear (ulong  n) 

II 

8 

9 

10 

ulong  test_change (ulong  n) 

II 

// 

Operations  on  all  bits: 

11 

void  clear_all() 

II 

12 

void  set_all() 

II 

13 

int  all_set_q()  const; 

II 

14 

int  all_clear_q()  const; 

II 

15 

16 

// 

Scanning  the  array: 

17 

//  Note:  the  given  index  n 

is 

18 

ulong  next_set_idx (ulong  n) 

19 

ulong  next_clear_idx (ulong 

n) 

Test  whether  n-th  bit 
Set  n-th  bit 
Clear  n-th  bit 
Toggle  n-th  bit 
Test  whether  n-th  bit 
Test  whether  n-th  bit 
Test  whether  n-th  bit 


set 


is  set  and  set  it 
is  set  and  clear  it 
is  set  and  toggle  it 


Clear  all  bits 
Set  all  bits 

Return  whether  all  bits  are  set 
Return  whether  all  bits  are  clear 


included  in  the  search 

const  //  Return  index  of  next  set  or  value  beyond  end 
const  //  Return  index  of  next  clear  or  value  beyond  end 


Combined  operations  like  ‘test-and-set-bit’,  ‘test-and-clear-bit’,  ‘test-and-change-bit’  are  often  needed  in 
applications  that  use  bit-arrays.  This  is  why  modern  CPUs  often  have  instructions  implementing  these 
operations. 


The  class  does  not  supply  overloading  of  the  array-index  operator  [ ] because  the  writing  variant 
would  cause  a performance  penalty.  One  might  want  to  add  ‘sparse’-versions  of  the  scan  functions 
(next_set_idx()  and  next_clear_idx())  for  large  bit-arrays  with  only  few  bits  set  or  unset. 

On  the  AMD64  architecture  the  corresponding  CPU  instructions  are  used  [FXT:  bits/bitasm-amd64.h  : 

1 static  inline  ulong  asm_bts (ulong  *f , ulong  i) 

2 //  Bit  Test  and  Set 

3 { 

4 ulong  ret ; 

5 asm  ( "btsq  "/,2,  7,1  \n" 

6 "sbbq  7,0,  7,0" 

7 : "=r"  (ret) 

8 : "m"  (*f),  "r"  (i)  ); 

9 return  ret ; 

10  } 
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If  no  specialized  CPU  instructions  are  available,  the  following  two  macros  are  used: 

1 #def ine  DIVM0D(n,  d,  bm)  \ 

2 ulong  d = n / BITS_PER_LONG;  \ 

3 ulong  bm  = 1UL  « (n  '/.  BITS_PER_LONG)  ; 

1 #def ine  DIVMOD_TEST(n,  d,  bm)  \ 

2 ulong  d = n / BITS_PER_LONG;  \ 

3 ulong  bm  = 1UL  « (n  '/.  BITS_PER_LONG)  ; \ 

4 ulong  t = bm  & f_[d]  ; 

The  macro  BITS_USE_ASM  determines  whether  the  CPU  instruction  is  available: 

1 ulong  test_set (ulong  n)  //  Test  whether  n-th  bit  is  set  and  set  it. 


2 

i 

3 

#if def 

BITS_USE_ASM 

4 

return  asm_bts(f_,  n) 

5 

#else 

6 

DIVM0D_TEST (n,  d,  bm) 

7 

f_[d]  |=  bm; 

8 

return  t ; 

9 

#endif 

10 

1 

Performance  is  still  good  in  that  case  as  the  modulo  operation  and  division  by  BITS_PER_LONG  (a  power 
of  2)  are  replaced  with  cheap  (bit-and  and  shift)  operations.  On  the  machine  described  in  appendix  [B| 
on  page  |922|  both  versions  give  practically  identical  performance. 

The  way  that  out  of  bounds  are  handled  can  be  defined  at  the  beginning  of  the  header  file: 

#define  CHECK  0 //  define  to  disable  check  of  out  of  bounds  access 

//#define  CHECK  1 //  define  to  handle  out  of  bounds  access 
//#define  CHECK  2 //  define  to  fail  with  out  of  bounds  access 


4.7  Left-right  array 


The  left-right  array  (or  LR-array ) keeps  track  of  a range  of  indices  0, . . . , n — 1.  Every  index  can  have 
two  states,  free  or  set.  The  LR-array  implements  the  following  operations  in  time  O (logn):  marking  the 
k- th  free  index  as  set;  marking  the  fc-tli  set  index  as  free;  for  the  i-th  (absolute)  index,  finding  how  many 
indices  of  the  same  type  (free  or  set)  are  left  (or  right)  to  it  (including  or  excluding  i). 
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The  implementation  is  given  as  [FXT:  class  left_right_array  in  ds/left-right-array.h  : 


class  left_right_array 
{ 

public : 

ulong  *fl. 
bool  *tg_ 
ulong  n_; 
ulong  f_; 


//  Free  indices  Left  (including  current  element)  in  bsearch  interval 
//  tags:  tg[i]==true  if  and  only  if  index  i is  free 
//  total  number  of  indices 
//  number  of  free  indices 


The  arrays  used  have  n elements: 
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public : 

left_right_array (ulong  n) 

1 

n_  = n; 

fl_  = new  ulong  [n_] ; 
tg_  = new  bool [n_] ; 
free_all () ; 

} 

~lef t_right_array () 

f 

delete  []  fl_; 
delete  []  tg_; 

> 

ulong  num_free()  const  { return  f. 
ulong  num_set()  const  { return  n. 


} 


- f_; 


} 


The  initialization  routine  free_all ()  of  the  array  fl[]  uses  a variation  of  the  binary  search  algorithm 
described  in  section  [3172]  on  page  EH 
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private : 

void  init_rec (ulong  iO,  ulong  il) 

//  Set  elements  of  f 1 [0 , . . . ,n-2]  according  to  empty  array  a[]. 
//  The  element  fl[n-l]  needs  to  be  set  to  1 afterwards. 

//  Work  is  0(n) . 

{ 

if  ( (il-iO) ! =0  ) 

{ 

ulong  t = (il+i0)/2; 
init_rec(iO,  t) ; 
init_rec(t+l , il) ; 

} 

f 1_  [il]  = il-iO+1; 

> 

public : 

void  free_all() 

//  Mark  all  indices  as  free. 

{ 

f_  = n_; 

for  (ulong  j =0 ; j<n_;  ++j)  tg_[j]  = true; 
init_rec(0,  n_-l) ; 
f 1_  [n_-l]  = 1; 

} 


The  crucial  observation  is  that  the  set  of  all  intervals  occurring  with  binary  search  is  fixed  if  the  size  of 
the  searched  array  is  fixed.  For  any  interval  [*o,ii]  the  element  fl[t]  where  t = L(*o  +ii)/2j  contains 
the  number  of  free  positions  in  [?’o,f].  The  following  method  returns  the  fc-th  free  index: 


1 ulong  get_free_idx (ulong  k)  const 

2 //  Return  the  k-th  ( 0 <=  k < num_free() 

3 //  Return  ~0UL  if  k is  out  of  bounds. 

4 //  Work  is  0(log(n)). 

5 { 

6 if  ( k >=  num_free()  ) return  ~0UL; 

I ulong  iO  = 0,  il  = n_-l; 

9 while  ( 1 ) 


10 

{ 

11 

ulong  t = (il+i0)/2; 

12 

if  ( (fl_[t]  ==  k+1)  &&  (tg_[t]) 

13 

14 

if  ( fl  [t]  > k ) //  left: 

15 

{ 

16 

il  = t ; 

17 

} 

18 

else  //  right: 

19 

{ 

20 

1 1 

-P 

1 1 

1 

l — 1 

4H 

II 

1 

t — 1 
+ 
-P 

II 

O 

•H 

21 

} 

22 

} 

23 

> 

) free  index. 


) return  t ; 


Usually  one  would  have  an  extra  array  where  one  actually  does  write  to  the  position  returned  above. 
Then  the  data  of  the  LR-array  has  to  be  modified  accordingly.  The  following  method  does  this: 

1 ulong  get_free_idx_chg(ulong  k) 

2 //  Return  the  k-th  ( 0 <=  k < num_free()  ) free  index. 

3 //  Return  ~0UL  if  k is  out  of  bounds. 

4 //  Change  the  arrays  and  f 1 []  and  tg[]  reflecting 

5 //  that  index  i will  be  set  afterwards. 

6 //  Work  is  0(log(n)). 

7 { 

8 if  ( k >=  num_free()  ) return  ~0UL; 

18  — f_; 

li  ulong  iO  = 0,  il  = n_-l; 

13  while  ( 1 ) 

14  { 

15  ulong  t = (il+i0)/2; 

16 

17  if  ( (f  1_  [t]  ==  k+1)  &&  (tg_[t])  ) 

18  { 

19 

20 
21 


— fl_[t]  ; 
tg_  [t]  = false; 
return  t ; 
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23 

24 

25 
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29 

30 

31 

32 

33 

34 


} 

if  ( fl_[t]  > k ) //  left: 

{ 

— fl_[t]  ; 
il  = t ; 

} 

else  //  right: 

{ 

iO  = t+1;  k-=fl_[t]; 

} 

} 

> 


fl[]=  1 
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2 
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* 

* 

4 

2 

fl[]=  0 

'2 

1 

2 

0 

0 

a []  = 1 

3 5* 

* 

* 

* 

4 

2 

(continued) 

last:  

f 1 []  =000121  100 
a []  =135***642 

first:  

f 1 []  =000011100 
a []  =1357**642 

last:  

fl[]  =000010000 
a []  =1357*8642 

first:  

f 1 []  =000000000 
a []  =135798642 


Figure  4.7-A:  Alternately  setting  the  first  and  last  free  position  in  an  LR-array.  Asterisks  denote  free 
positions,  indices  i where  tg[i]  is  true. 


For  example,  the  following  program  sets  alternately  the  first  and  last  free  position  until  no  free  position 
is  left  [FXT:  ds/left-right-array-demo.cc  : 
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ulong  n = 9; 

ulong  *A  = new  ulong [n] ; 
left_right_array  LR(n) ; 

LR. free_all() ; 

//  PRINT 

for  (ulong  e=0;  e<n;  ++e) 

1 

ulong  s = 0;  //  first  free 

if  ( 0!=(e&l)  ) s = LR.num_free()-l ; //  last  free 

ulong  idx2  = LR. get_f ree_idx_chg(s) ; 

A[idx2]  = e+1; 

//  PRINT 

} 


Its  output  is  shown  in  figure  4.7-A  For  large  n the  method  get_free_idx_chg()  runs  at  a rate  of  (very 
roughly)  2 million  per  second.  The  method  to  free  the  fc-th  set  position  is 


1 ulong  get_set_idx_chg(ulong  k) 

2 //  Return  the  k-th  ( 0 <=  k < num_set()  ) set  index. 

3 //  Return  ~0UL  if  k is  out  of  bounds. 

4 //  Change  the  arrays  and  f 1 []  and  tg[]  reflecting 

5 //  that  index  i will  be  freed  afterwards. 

6 //  Work  is  0(log(n)). 

7 { 

8 if  ( k >=  num_set()  ) return  ~0UL; 

l8  ++f_; 

\h  ulong  iO  = 0,  il  = n_-l; 

13  while  ( 1 ) 

14  { 
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15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


27 

28 

29 

30 

31 

32 

33 

34 

35 

36 


ulong  t = (il+i0)/2; 

//  how  many  elements  to  the  left  are  set: 
ulong  sit  = t-iO+1  - fl_[t]; 

if  ( (sit  ==  k+1)  &&  (tg_ [t] ==f alse)  ) 

{ 

++f  1_  [t]  ; 
tg_  [t]  = true; 
return  t ; 

} 

if  ( sit  > k ) //  left: 

1 

++f 1_  [t]  ; 
il  = t ; 

} 

else  //  right: 

{ 

iO  = t+1;  k-=slt; 

} 

} 

} 


The  following  method  returns  the  number  of  free  indices  left  of  i (and  excluding  i): 


1 

2 

3 

4 

5 

9 
8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 
23 

26 


ulong  num_FLE (ulong  i)  const 

//  Return  number  of  Free  indices  Left  of  (absolute)  index  i (Excluding  i) . 
//  Work  is  0(log(n)). 

{ 

if  ( i >=  n_  ) { return  ~0UL;  } //  out  of  bounds 

ulong  iO  = 0,  il  = n_-l; 

ulong  ns  = i;  //  number  of  set  element  left  to  i (including  i) 
while  ( 1 ) 

{ 

if  ( i0==il  ) break; 

ulong  t = (il+i0)/2; 
if  ( i<=t  ) //  left: 

{ 

il  = t ; 

} 

else  //  right: 

{ 

ns  -=  f 1_  [t]  ; 
iO  = t+1; 

} 

} 

return  i-ns; 

> 


Based  on  it  are  methods  to  determine  the  number  of  free/set  indices  to  the  left/right,  including/excluding 
the  given  index.  We  omit  the  out-of-bounds  clauses  in  the  following: 

1 ulong  num_FLI (ulong  i)  const 

2 //  Return  number  of  Free  indices  Left  of  (absolute)  index  i (Including  i) . 

3 { return  num_FLE(i)  + tg_  [i] ; } 

4 

5 ulong  num_FRE (ulong  i)  const 

6 //  Return  number  of  Free  indices  Right  of  (absolute)  index  i (Excluding  i) . 

7 { return  num_free()  - num_FLI(i);  } 

8 

9 ulong  num_FRI (ulong  i)  const 

10  //  Return  number  of  Free  indices  Right  of  (absolute)  index  i (Including  i) . 

11  { return  num_free()  - num_FLE(i) ; } 

12 

13  ulong  num_SLE (ulong  i)  const 

14  //  Return  number  of  Set  indices  Left  of  (absolute)  index  i (Excluding  i) . 

15  { return  i - num_FLE(i);  } 

16 

17  ulong  num_SLI (ulong  i)  const 

18  //  Return  number  of  Set  indices  Left  of  (absolute)  index  i (Including  i) . 

19  f return  i - num_FLE(i)  + ! tg_ [i] ; } 

20 

21  ulong  num_SRE (ulong  i)  const 

22  //  Return  number  of  Set  indices  Right  of  (absolute)  index  i (Excluding  i) . 
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23  { return  num  set()  - num  SLI(i);  } 

24 

25  ulong  num_SRI (ulong  i)  const 

26  //  Return  number  of  Set  indices  Right  of  (absolute)  index  i (Including  i) . 

27  { return  num_set()  - i + num_FLE(i);  } 

These  can  be  used  for  the  fast  conversion  between  permutations  and  inversion  tables,  see  section  10.1.1.1 
on  page  [235] 
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Chapter  5 

Conventions  and  considerations 


We  give  algorithms  for  the  generation  of  all  combinatorial  objects  of  certain  types  such  as  combinations, 
compositions,  subsets,  permutations,  integer  partitions,  set  partitions,  restricted  growth  strings  and  neck- 
laces. Finally,  we  give  some  constructions  for  Hadamard  and  conference  matrices.  Several  (more  esoteric) 
combinatorial  objects  that  are  found  via  searching  in  directed  graphs  are  presented  in  chapter  [20| 

These  routines  are  useful  in  situations  where  an  exhaustive  search  over  all  configurations  of  a certain  kind 
is  needed.  Combinatorial  algorithms  are  also  fundamental  to  many  programming  problems  and  they  can 
simply  be  fun! 

5.1  Representations  and  orders 

For  a set  of  n elements  we  will  take  either  {0, 1, . . . , n — 1}  or  {1,2,...,  n}.  Our  convention  for  the  set 
notation  is  to  start  with  the  smallest  element.  Often  there  is  more  than  one  useful  way  to  represent  a 
combinatorial  object.  For  example  the  subset  {1,4,6}  of  the  set  {0, 1,  2,  3, 4,  5, 6}  can  also  be  written 
as  a delta  set  [0100101],  Some  sources  use  the  term  bit  string.  We  often  write  dots  instead  of  zeros 
for  readability:  [.  1 . .1.1],  Note  that  in  the  delta  set  we  put  the  first  element  to  the  left  side  ( array 
notation),  this  is  in  contrast  to  the  usual  way  of  printing  binary  numbers,  where  the  least  significant  bit 
(bit  number  zero)  is  shown  on  the  right  side. 

For  most  objects  we  will  give  an  algorithm  for  generation  in  lexicographic  (or  simply  lex)  order.  In 
lexicographic  order  a string  X = [xq,  Xi,  . . .]  precedes  the  string  Y = [y0,yi, . . .]  if  for  the  smallest  index 
k where  the  strings  differ  we  have  Xk  < yk-  Further,  the  string  X precedes  X.W  (the  concatenation  of  X 
with  W)  for  any  nonempty  string  W.  The  co-lexicographic  (or  simply  colex)  order  is  obtained  by  sorting 
with  respect  to  the  reversed  strings.  The  order  sometimes  depends  on  the  representation  that  is  used, 
for  an  example  see  figure  |8.1-A| on  page |202| 

In  a minimal- change  order  the  amount  of  change  between  successive  objects  is  the  least  possible.  Such 
an  order  is  also  called  a (combinatorial)  Gray  code.  There  is  in  general  more  than  one  such  order.  Often 
we  can  impose  even  stricter  conditions,  like  that  (with  permutations)  the  changes  are  between  adjacent 
positions.  The  corresponding  order  is  a strong  minimal-change  order.  A very  readable  survey  of  Gray 
codes  is  given  in  [343] , see  also  (2981 . 

5.2  Ranking,  unranking,  and  counting 

For  a particular  ordering  of  combinatorial  objects  (say,  lexicographic  order  for  permutations)  we  can  ask 
which  position  in  the  list  a given  object  has.  An  algorithm  for  finding  the  position  is  called  a ranking 
algorithm.  A method  to  determine  the  object,  given  its  position,  is  called  an  unranking  algorithm. 

Given  both  ranking  and  unranking  methods,  one  can  compute  the  successor  of  a given  object  by  computing 
its  rank  r and  unranking  r + 1.  While  this  method  is  usually  slow  the  idea  can  be  used  to  find  more 
efficient  algorithms  for  computing  the  successor.  In  addition  the  idea  often  suggests  interesting  orderings 
for  combinatorial  objects. 
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We  sometimes  give  ranking  or  unranking  methods  for  numbers  in  special  forms  such  as  factorial  represen- 
tations for  permutations.  Ranking  and  unranking  methods  are  implicit  in  generation  algorithms  based 
on  mixed  radix  counting  given  in  section  |10.9  on  page  258| 


A simple  but  surprisingly  powerful  way  to  discover  isomorphisms  (one-to-one  correspondences)  between 
combinatorial  objects  is  counting  them.  If  the  sequences  of  numbers  of  two  kinds  of  objects  are  identical, 
chances  are  good  of  finding  a conversion  routine  between  the  corresponding  objects.  For  example,  there 
are  2ra  permutations  of  n elements  such  that  no  element  lies  more  than  one  position  to  the  right  of 
its  original  position.  With  this  observation  an  algorithm  for  generating  these  permutations  via  binary 
counting  can  be  found,  see  section [Tl. 2 on  page  282| 


The  representation  of  combinatorial  objects  as  restricted  growth  strings  (as  shown  in  section  15.2  on 


page  325 ) follows  from  the  same  idea.  The  resulting  generation  methods  can  be  very  fast  and  flexible. 


The  number  of  objects  of  a given  size  can  often  be  given  by  an  explicit  expression  (for  example,  the 
number  of  parentheses  strings  of  n pairs  is  the  Catalan  number  Cn  = (2")/(?i  + 1);  see  section 


15.4 


on 


page  331 ).  The  ordinary  generating  function  (OGF)  for  a combinatorial  object  has  a power  series  whose 
coefficients  count  the  objects:  for  the  Catalan  numbers  we  have  the  OGF 


C (*) 


71—0 


1 - VI  -4a: 
2 x 


(5.2-1) 


Generating  functions  can  often  be  given  even  though  no  explicit  expression  for  the  number  of  the  objects 
is  known.  The  generating  functions  sometimes  can  be  used  to  observe  nontrivial  identities,  for  example, 
that  the  number  of  partitions  into  distinct  parts  equals  the  number  of  partitions  into  odd  parts,  given  as 

An  exponential  generating  function  (EGF)  for  a type  of  object  where  there 

on  page 


on  page 


348 


relation  16.4-23 

are  En  objects  of  size  n has  the  power  series  of  the  form  (see,  for  example,  relation  11.1-7 


2791 


~ Tn 
' n\ 

71=0 


(5.2-2) 


An  excellent  introduction  to  generating  functions  is  given  in  |166j.  for  in-depth  information  see  [1671 
vol.2.  chaD.21,  n.1021],  |143|,  and  m- 


5.3  Characteristics  of  the  algorithms 

In  almost  all  cases  we  produce  the  combinatorial  objects  one  by  one.  Let  n be  the  size  of  the  object.  The 
successor  (with  respect  to  the  specified  order)  is  computed  from  the  object  itself  and  additional  data  of 
a size  less  than  a constant  multiple  of  n. 

Let  B be  the  total  number  of  combinatorial  objects  under  consideration.  Sometimes  the  cost  of  a successor 
computation  is  0{n).  Then  the  total  cost  for  generating  all  objects  is  0(n  ■ B). 

If  the  successor  computation  takes  a fixed  number  of  operations  (independent  of  the  object  size),  then 
we  say  the  algorithm  is  0(1)-  If  so,  there  can  be  no  loop  in  the  implementation,  we  say  the  algorithm  is 
loopless.  Then  the  total  cost  for  all  objects  is  c ■ B for  some  constant  c,  independent  of  the  object  size. 
A loopless  algorithm  can  only  exist  if  the  amount  of  change  between  successive  objects  is  bounded  by  a 
constant  that  does  not  depend  on  the  object  size.  Natural  candidates  for  loopless  algorithms  are  Gray 
codes. 

In  many  cases  the  cost  of  computing  all  objects  is  also  c • B while  the  computation  of  the  successor  does 
involve  a loop.  As  an  example  consider  incrementing  in  binary  using  arrays:  in  half  of  the  cases  just 
the  lowest  bit  changes,  for  half  of  the  remaining  cases  just  two  bits  change,  and  so  on.  The  total  cost 
is  B ■ (1  -f  |(1  + f (■■■)))=  2 ■ B,  independent  of  the  number  of  bits  used.  So  the  total  cost  is  as  in 
the  loopless  case  while  the  successor  computation  can  be  expensive  in  some  cases.  Algorithms  with  this 
characteristic  are  said  to  be  constant  amortized  time  (or  CAT).  Often  CAT  algorithms  are  faster  than 
loopless  algorithms,  typically  if  their  structure  is  simpler. 
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5.4  Optimization  techniques 


Let  x be  an  array  of  n elements.  The  loop 
ulong  k = 0; 

while  ( (k<n)  &&  (x[k]!=0)  ) ++k;  //  find  first  zero 

can  be  replaced  by 
ulong  k = 0; 

while  ( x [k] ! =0  ) ++k;  //  find  first  zero 

if  a single  sentinel  element  x [n]  =0  is  appended  to  the  end  of  the  array.  The  latter  version  will  often  be 
faster  as  less  branches  occur. 

The  test  for  equality  as  in 
ulong  k = 0; 

while  ( k!=n  ) { /*...*/  ++k;  } 

is  more  expensive  than  the  test  for  equality  with  zero  as  in 
ulong  k = n; 

while  ( — k!=0  ) {/*...*/} 


Therefore  the  latter  version  should  be  used  when  applicable. 
To  reduce  the  number  of  branches,  replace  the  two  tests 
if  C (x<0)  I I (x>m)  ) { /*...*/  } 

by  the  following  single  test  where  unsigned  integers  are  used: 
if  ( x>m  ) { /*...*/  } 


Use  a do-while  construct  instead  of  a while-do  loop  whenever  possible  because  the  latter  also  tests  the 
loop  condition  at  entry.  Even  if  the  do-wlrile  version  causes  some  additional  work,  the  gain  from  avoiding 
a branch  may  outweigh  it.  Note  that  in  the  C language  the  for-loop  also  tests  the  condition  at  loop  entry. 


When  computing  the  next  object  there  may  be  special  cases  where  the  update  is  easy.  If  the  percentage 
of  these  ‘easy  cases’  is  not  too  small,  an  extra  branch  in  the  update  routine  should  be  created.  The 


performance  gain  is  very  visible  in  most  cases  (section  10.4  on  page  245 ) and  can  be  dramatic  (section  10.5 


on  page  248). 


Recursive  routines  can  be  quite  elegant  and  versatile,  see,  for  example,  section  6.4  on  page  182  and 
section [13.2.1  on  page  297|  However,  expect  only  about  half  the  speed  of  a good  iterative  implementation 
of  the  same  algorithm.  The  notation  for  list  recursions  is  given  in  section  [l 4. 1|  on  page  |304| 


Address  generation  can  be  simpler  if  arrays  are  used  instead  of  pointers.  This  technique  is  useful  for 
many  permutation  generators,  see  chapter  [To]  on  page  |232|  Change  the  pointer  declarations  to  array 
declarations  in  the  corresponding  class  as  follows: 


//ulong  *p_;  //  permutation  data  (pointer  version) 

ulong  p_  [32] ; //  permutation  data  (array  version) 


Here  we  assume  that  nobody  would  attempt  to  compute  all  permutations  of  31  or  more  elements  (31!  « 
8.22  • 1033,  taking  about  1.3  • 1018  years  to  finish).  To  use  arrays  uncomment  (in  the  corresponding  header 
files)  a line  like 


#define  PERM_REV2_FIXARRAYS  //  use  arrays  instead  of  pointers  (speedup) 


This  will  also  disable  the  statements  to  allocate  and  free  memory  with  the  pointers.  Whether  the  use  of 
arrays  tends  to  give  a speedup  is  noted  in  the  comment.  The  performance  gain  can  be  spectacular,  see 
section  [77T|  on  page|194| 


5.5  Implementations,  demo-programs,  and  timings 

Most  combinatorial  generators  are  implemented  as  C++  classes.  The  first  object  in  the  given  order  is 
created  by  the  method  first ().  The  method  to  compute  the  successor  is  usually  next().  If  a method 
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for  the  computation  of  the  predecessor  is  given,  then  it  is  called  prevO  and  a method  last  ()  to  compute 
the  last  element  in  the  list  is  given. 

The  current  combinatorial  object  can  be  accessed  through  the  method  dataQ.  To  make  all  data  of  a 
class  accessible  the  data  is  declared  public.  This  way  the  need  for  various  get_something()  methods 
is  avoided.  To  minimize  the  danger  of  accidental  modification  of  class  data  the  variable  names  end  with 
an  underscore.  For  example,  the  class  for  the  generation  of  combinations  in  lexicographic  order  starts  as 

class  combination_lex 

{ 

public : 

ulong  *x_;  //  combination:  k elements  0<=x[j]<k  in  increasing  order 

ulong  n_,  k_ ; //  Combination  (n  choose  k) 

The  methods  for  the  user  of  the  class  are  public,  the  internal  methods  (which  can  leave  the  data  in  an 
inconsistent  state)  are  declared  private. 

Timings  for  the  routines  are  given  with  most  demo-programs.  For  example,  the  timings  for  the  generation 
of  subsets  in  minimal-change  order  (as  delta  sets,  implemented  in  [FXT:  class  subset_gray_delta  in 
comb /subset-gray-delta. h ) are  given  near  the  end  of  [FXT:  comb/subset-gray-delta-demo. cc  , together 
with  the  parameters  used: 

Timing: 
time  ./bin  30 

arg  1:  30  ==  n [Size  of  the  set]  default=5 
arg  2:  0 ==  cq  [Whether  to  start  with  full  set]  default=0 
./bin  30  5.90s  user  0.02s  system  100"/,  cpu  5.912  total 

==>  2~30/5.90  ==  181,990,139  per  second 

//  with  SUBSET_GRAY_DELTA_MAX_ARRAY_LEN  defined: 
time  ./bin  30 

arg  1:  30  ==  n [Size  of  the  set]  default=5 
arg  2:  0 ==  cq  [Whether  to  start  with  full  set]  default=0 
./bin  30  5.84s  user  0.01s  system  99"/,  cpu  5.853  total 

==>  2~30/5.84  ==  183,859,901  per  second 

For  your  own  measurements  simply  uncomment  the  line 

//#define  TIMING  //  uncomment  to  disable  printing 

near  the  top  of  the  demo-program.  The  rate  of  generation  for  a certain  object  is  occasionally  given  as 
123  M/s,  meaning  that  123  million  objects  are  generated  per  second. 

If  a generator  routine  is  used  in  an  application,  one  must  do  the  benchmarking  with  the  application. 
Choosing  the  optimal  ordering  and  type  of  representation  (for  example,  delta  sets  versus  sets)  for  the 
given  task  is  crucial  for  good  performance.  Further  optimization  will  very  likely  involve  the  surrounding 
code  rather  than  the  generator  alone. 
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Chapter  6 

Combinations 


We  give  algorithms  to  generate  all  subsets  of  the  n-element  set  that  contain  k elements.  For  brevity  we 
sometimes  refer  to  the  Q()  combinations  of  k out  of  n elements  as  “the  combinations  ()()”. 

6.1  Binomial  coefficients 
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Figure  6.1-A:  The  binomial  coefficients  (™)  for  0 < n,  k < 15. 


The  number  of  ways  to  choose  k elements  from  a set  of  n elements  equals  the  binomial  coefficient  (‘n 
choose  k\  or  lk  out  of  n’): 

n (n  — 1)  (n  — 2)  ...  (n  — k+l)  _ IIy=i  (n-  j + 1)  _ n-  . 

k {k  - 1)  (k  - 2)  . . . 1 " k\  kk  ( ‘ ' ’ 

The  last  equality  uses  the  falling  factorial  notation  a-  :=  a (a  — 1)  (a  — 2)  ...  (a  — b + 1).  Equivalently,  a 
set  of  n elements  has  (^)  subsets  of  exactly  k elements.  These  subsets  are  called  the  k-subsets  (where  k 
is  fixed)  or  k- combinations  of  an  n-set  (a  set  with  n elements). 

To  avoid  overflow  during  the  computation  of  the  binomial  coefficient,  use  the  form 

/ n\  (n  — k + l)k  n — k + 1 n — k + 2 n — k + 3 n 

UJ  = if = — 5 3 k <61-2) 

An  implementation  is  given  in  [FXT: 

1 inline  ulong  binomial (ulong  n,  ulong  k) 

{ 

if  ( k>n  ) return  0 ; 
if  ( (k==0)  | | (k==n)  ) return  1; 
if  ( 2*k  > n ) k = n-k;  //  use  symmetry 

ulong  b = n - k + 1 ; 
ulong  f = b; 

for  (ulong  j=2;  j<=k;  ++j) 


auxO/binomial.h  : 


n\  n! 

k)  k\{n  — k)\ 
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10 

i 

11 

++f ; 

12 

b *=  f 

13 

b /=  j 

14 

} 

15 

16  } 

return  b ; 

The  table  of  the  first  binomial  coefficients  is  shown  in  figure |6)  1-A[  This  table  is  called  Pascal’s  triangle , 
it  was  generated  with  the  program  [FXT:  comb/binomial-demo.cc  . Observe  that 


(6.1-3) 


That  is,  each  entry  is  the  sum  of  its  upper  and  left  upper  neighbor.  The  generating  function  for  the 
^-combinations  of  an  n-set  is 


(1  +x)n 


(6.1-4) 


6.2  Lexicographic  and  co-lexicographic  order 


lexicographi 

c 

co-lexicographic 

set 

delta  set 

set 

delta  set 

set 

reversed 

1 

{ 

0, 

1, 

2 

> 

111.  . . 

1 

{ 

0, 

1, 

2 

> 

111.  . . 

{ 

2, 

1, 

0 

> 

2 

{ 

0, 

1, 

3 

} 

11.1.  . 

2 

{ 

0, 

1, 

3 

} 

11.1.  . 

{ 

3, 

1, 

0 

> 

3 

{ 

0, 

1, 

4 

> 

11. .1. 

3 

{ 

0, 

2, 

3 

} 

1.11.  . 

< 

3, 

2, 

0 

> 

4 

{ 

0, 

1, 

5 

} 

11.  . . 1 

4 

{ 

1, 

2, 

3 

} 

.111.  . 

{ 

3, 

2, 

1 

> 

5 

{ 

0, 

2, 

3 

} 

1.11.  . 

5 

{ 

0, 

1, 

4 

} 

11. .1. 

{ 

4, 

1, 

0 

> 

6 

{ 

0, 

2, 

4 

} 

1.1.1. 

6 

{ 

0, 

2, 

4 

> 

1.1.1. 

{ 

4, 

2, 

0 

> 

7 

{ 

0, 

2, 

5 

> 

1.1.  .1 

7 

{ 

1, 

2, 

4 

} 

.11.1. 

{ 

4, 

2, 

1 

> 

8 

{ 

0, 

3, 

4 

} 

1. .11. 

8 

{ 

0, 

3, 

4 

} 

1. .11. 

{ 

4, 

3, 

0 

> 

9 

{ 

0, 

3, 

5 

} 

1. .1.1 

9 

{ 

1, 

3, 

4 

} 

.1.11. 

{ 

4, 

3, 

1 

> 

10 

{ 

0, 

4, 

5 

} 

1. . .11 

10 

{ 

2, 

3, 

4 

> 

. .111. 

{ 

4, 

3, 

2 

> 

11 

{ 

1, 

2, 

3 

} 

.111.  . 

11 

{ 

0, 

1, 

5 

} 

11.  . .1 

< 

5, 

1, 

0 

> 

12 

{ 

1, 

2, 

4 

} 

.11.1. 

12 

{ 

0, 

2, 

5 

} 

1.1.  .1 

{ 

5, 

2, 

0 

> 

13 

{ 

1, 

2, 

5 

} 

.11.  .1 

13 

{ 

1, 

2, 

5 

> 

.11.  .1 

{ 

5, 

2, 

1 

> 

14 

{ 

1, 

3, 

4 

} 

.1.11. 

14 

{ 

0, 

3, 

5 

} 

1.  .1.1 

{ 

5, 

3, 

0 

> 

15 

{ 

1, 

3, 

5 

} 

.1.1.1 

15 

{ 

1, 

3, 

5 

} 

.1.1.1 

{ 

5, 

3, 

1 

> 

16 

{ 

1, 

4, 

5 

} 

.1. .11 

16 

{ 

2, 

3, 

5 

> 

. .11.1 

{ 

5, 

3, 

2 

> 

17 

{ 

2, 

3, 

4 

> 

. .111. 

17 

{ 

0, 

4, 

5 

} 

1. . .11 

{ 

5, 

4, 

0 

> 

18 

{ 

2, 

3, 

5 

} 

. .11.1 

18 

{ 

1, 

4, 

5 

} 

.1. .11 

{ 

5, 

4, 

1 

> 

19 

{ 

2, 

4, 

5 

} 

. .1.11 

19 

{ 

2, 

4, 

5 

> 

. .1.11 

{ 

5, 

4, 

2 

> 

20 

{ 

3, 

4, 

5 

> 

. . .111 

20 

{ 

3, 

4, 

5 

> 

. . .111 

{ 

5, 

4, 

3 

> 

Figure  6. 2- A:  All  combinations  (®)  in  lexicographic  order  (left)  and  co-lexicographic  order  (right). 


The  combinations  of  three  elements  out  of  six  in  lexicographic  (or  simply  lex ) order  are  shown  in  figure  6.2- 
[A]  (left) . The  sequence  is  such  that  the  sets  are  ordered  lexicographically.  Note  that  for  the  delta  sets  the 
element  zero  is  printed  first  whereas  with  binary  words  (section  1.24  on  page  62)  the  least  significant  bit 
(bit  zero)  is  printed  last.  The  sequence  for  co-lexicographic  (or  cole x)  order  is  such  that  the  sets,  when 
written  reversed,  are  ordered  lexicographically. 


6.2.1  Lexicographic  order 

The  following  implementation  generates  the  combinations  in  lexicographic  order  as  sets  [FXT:  class 
combination_lex  in  comb/combination-lex. h : 

1 class  combination_lex 

2 { 

3 public: 

4 ulong  *x_ ; //  combination:  k elements  0<=x[j]<k  in  increasing  order 
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5 

ulong  n_,  k_;  //  Combination  (n 

choose 

9 

public : 

8 

combination_lex (ulong  n,  ulong  k) 

9 

10 

n_  = n;  k_  = k; 

11 

x_  = new  ulong [k_] ; 

12 

f irst  ()  ; 

13 

14 

> 

15 

16 

~combination_lex()  { delete  [] 

x_;  > 

17 

void  firstO 

18 

1 

19 

for  (ulong  k=0;  k<k  ; ++k) 

x_[k] 

20 

21 

} 

22 

void  lastO 

23 

1 

24 

for  (ulong  i=0;  i<k  ; ++i) 

x_[i] 

25 

26 

} 

Computation  of  the  successor  and  predecessor: 


1 

2 

3 

4 

5 

9 
8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 
19 

2? 

22 

1 

2 

3 

4 

5 

6 

7 

8 
9 

1? 

12 

13 

14 

15 

19 

18 


ulong  next O 

//  Return  smallest  position  that  changed,  return  k with  last  combination 

1 

if  ( x_ [0]  ==  n_  - k_  ) //  current  combination  is  the  last 

{ firstO;  return  k_;  } 

ulong  j = k_  - 1 ; 

//  easy  case:  highest  element  !=  highest  possible  value: 

if  ( x_[j]  < (n_-l)  ) { ++x_  [j]  ; return  j;  > 

//  find  highest  falling  edge: 

while  ( 1 ==  (x_  [ j ] - x_  [ j — 1]  ) ) { — j;  } 

//  move  lowest  element  of  highest  block  up: 
ulong  ret  = j - 1 ; 
ulong  z = ++x_[j-l]; 

//  ...  and  attach  rest  of  block: 

while  ( j < k_  ) { x_[j]  = ++z;  ++ j ; } 

return  ret ; 

> 

ulong  prevO 

//  Return  smallest  position  that  changed,  return  k with  last  combination 

1 

if  ( x_[k_-l]  ==  k_-l  ) //  current  combination  is  the  first 

{ lastO;  return  k_;  } 

//  find  highest  falling  edge: 
ulong  j = k_  - 1 ; 

while  ( 1 ==  (x_  [ j ] - x_[j-l])  ) { — j;  } 
ulong  ret  = j ; 

— x_[j];  //  move  down  edge  element 

//  ...  and  move  rest  of  block  to  high  end: 
while  ( ++j  < k_  ) x_[j]  = n_  - k_  + j ; 

return  ret ; 

> 


The  listing  in  figure  6.2-A  was  created  with  the  program  [FXT:  comb/combination-lex-demo. cc|.  The 


routine  generates  the  combinations  (i^)  at  a rate  of  about  104  million  per  second.  The  combinations 
are  generated  at  a rate  of  166  million  per  second. 


o 


6.2.2  Co-lexicographic  order 

The  combinations  of  three  elements  out  of  six  in  co -lexicographic  (or  colex ) order  are  shown  in  fig- 


ure 


6.2-A  (right).  Algorithms  to  compute  the  successor  and  predecessor  are  implemented  in  [FXT:  class 


combination_colex  in  comb/combination-colex.h|: 


6.2:  Lexicographic  and  co-lexicographic  order 
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1 

2 

3 

4 

5 

6 
7 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


24 

25 

26 
27 


30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 


49 

50 


class  combination_colex 

{ 

public : 

ulong  *x_ ; //  combination:  k elements  0<=x[j]<k  in  increasing  order 

ulong  n_,  k_;  //  Combination  (n  choose  k) 

combination_colex (ulong  n,  ulong  k) 

1 

n_  = n;  k_  = k; 
x_  = new  ulong [k_+l] ; 
x_  [k_]  = n_  + 2;  //  sentinel 

first  () ; 

} 

[ — snip — ] 
ulong  next () 

//  Return  greatest  position  that  changed,  return  k with  last  combination 

1 

if  ( x_ [0]  ==  n_  - k_  ) //  current  combination  is  the  last 

{ firstO;  return  k_;  } 

ulong  j = 0; 

//  until  lowest  rising  edge:  attach  block  at  low  end 

while  ( 1 ==  (x_[j+l]  - x_[j])  ) { x_[j]  = j;  ++j  ; 1 //  can  touch  sentinel 

++x_[j];  //  move  edge  element  up 

return  j ; 

} 

ulong  prevQ 

//  Return  greatest  position  that  changed,  return  k with  last  combination 

{ 

if  ( x_[k_-l]  ==  k_-l  ) //  current  combination  is  the  first 

{ lastQ;  return  k_;  } 

//  find  lowest  falling  edge: 
ulong  j = 0; 

while  ( j ==  x_[j]  ) ++j ; //  can  touch  sentinel 

— x_  [ j ] ; //  move  edge  element  down 

ulong  ret  = j ; 

//  attach  rest  of  low  block: 

while  ( 0 ! = j — ) x_[j]  = x_[j  + l]  - 1; 

return  ret ; 

> 

[ — snip — ] 


The  listing  in  figure  6.2-A|  was  created  with  the  program  [FXT:  comb/combination-colex-demo.cc  . The 
combinations  are  generated  (j^)  at  a rate  of  about  140  million  objects  per  second,  the  combinations  (^) 
are  generated  at  a rate  of  190  million  objects  per  second. 


As  a toy  application  of  the  combinations  in  co-lexicographic  order  we  compute  the  products  of  k of 
the  n smallest  primes.  We  maintain  an  array  of  k products  shown  at  the  right  of  figure  6.2-B|  If  the 
return  value  of  the  method  next()  is  j,  then  j + 1 elements  have  to  be  updated  from  right  to  left  [FXT: 
comb/kproducts-colex-demo.cc  : 


1 combination_colex  C(n,  k) ; 

2 const  ulong  *c  = C.dataQ;  //  combinations  as  sets 

3 

4 ulong  *tf  = new  ulong  [n] ; //  table  of  Factors  (primes) 

5 //  fill  in  small  primes: 

6 for  (ulong  j=0,f=2;  j<n;  ++j)  { tf[j]  = f;  f=next_small_prime(f+l) ; } 

7 

8 ulong  *tp  = new  ulong [k+1] ; //  table  of  Products 

9 tp [k]  =1;  //  one  appended  (sentinel) 

1?  ulong  j = k-1; 

12  do 

13  { 

14 

15 

16 


//  update  products  from  right: 
ulong  x = tp[j  + l]; 

{ ulong  i = j; 
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combination 

j 

delta-set 

products 

1 

{ 

0, 

1, 

2 

} 

2 

111 

[ 30 

15 

5 

1 

] 

2 

{ 

0, 

1, 

3 

} 

2 

11.1.  . . 

[ 42 

21 

7 

1 

] 

3 

{ 

0, 

2, 

3 

} 

1 

1.11.  . . 

[ 70 

35 

7 

1 

] 

4 

{ 

1, 

2, 

3 

} 

0 

.111.  . . 

[ 105 

35 

7 

1 

] 

5 

{ 

0, 

1, 

4 

} 

2 

11.  .1.  . 

[ 66 

33 

11 

1 

] 

6 

{ 

0, 

2, 

4 

} 

1 

1.1.1.  . 

[ 110 

55 

11 

1 

] 

7 

{ 

1, 

2, 

4 

> 

0 

.11.1.  . 

[ 165 

55 

11 

1 

] 

8 

{ 

0, 

3, 

4 

} 

1 

1. .11.  . 

[ 154 

77 

11 

1 

] 

9 

{ 

1, 

3, 

4 

> 

0 

.1.11.  . 

[ 231 

77 

11 

1 

] 

10 

{ 

2, 

3, 

4 

> 

0 

. .111.  . 

[ 385 

77 

11 

1 

] 

11 

{ 

0, 

1, 

5 

} 

2 

11. . .1. 

[ 78 

39 

13 

1 

] 

12 

{ 

0, 

2, 

5 

} 

1 

1.1. .1. 

[ 130 

65 

13 

1 

] 

13 

{ 

1, 

2, 

5 

> 

0 

.11. .1. 

[ 195 

65 

13 

1 

] 

14 

{ 

0, 

3, 

5 

} 

1 

1. .1.1. 

[ 182 

91 

13 

1 

] 

15 

{ 

1, 

3, 

5 

} 

0 

.1.1.1. 

[ 273 

91 

13 

1 

] 

16 

{ 

2, 

3, 

5 

} 

0 

. .11.1. 

[ 455 

91 

13 

1 

] 

17 

{ 

0, 

4, 

5 

> 

1 

1. . .11. 

[ 286 

143 

13 

1 

] 

18 

{ 

1, 

4, 

5 

> 

0 

.1. .11. 

[ 429 

143 

13 

1 

] 

19 

{ 

2, 

4, 

5 

} 

0 

. .1.11. 

[ 715 

143 

13 

1 

] 

20 

{ 

3, 

4, 

5 

> 

0 

. . .111. 

[ 1001 

143 

13 

1 

] 

21 

{ 

0, 

1, 

6 

} 

2 

11 1 

[ 102 

51 

17 

1 

] 

22 

{ 

0, 

2, 

6 

} 

1 

1.1.  . .1 

[ 170 

85 

17 

1 

] 

23 

{ 

1, 

2, 

6 

} 

0 

.11.  . .1 

[ 255 

85 

17 

1 

] 

24 

{ 

0, 

3, 

6 

} 

1 

1.  .1.  .1 

[ 238 

119 

17 

1 

] 

25 

{ 

1, 

3, 

6 

I 

0 

.1.1.  .1 

[ 357 

119 

17 

1 

] 

26 

{ 

2, 

3, 

6 

> 

0 

. .11.  .1 

[ 595 

119 

17 

1 

] 

27 

{ 

0, 

4, 

6 

} 

1 

1. . .1.1 

[ 374 

187 

17 

1 

] 

28 

{ 

1, 

4, 

6 

} 

0 

.1. .1.1 

[ 561 

187 

17 

1 

] 

29 

{ 

2, 

4, 

6 

} 

0 

. .1.1.1 

[ 935 

187 

17 

1 

] 

30 

{ 

3, 

4, 

6 

} 

0 

. . .11.1 

[ 1309 

187 

17 

1 

] 

31 

{ 

0, 

5, 

6 

> 

1 

1 11 

[ 442 

221 

17 

1 

] 

32 

{ 

1, 

5, 

6 

} 

0 

.1. . .11 

[ 663 

221 

17 

1 

] 

33 

{ 

2, 

5, 

6 

> 

0 

. .1.  .11 

[ 1105 

221 

17 

1 

] 

34 

{ 

3, 

5, 

6 

} 

0 

. . .1.11 

[ 1547 

221 

17 

1 

] 

35 

{ 

4, 

5, 

6 

> 

0 

111 

[ 2431 

221 

17 

1 

] 

Figure  6.2-B:  All  products  of  k = 3 of  the  n = 7 smallest  primes  (2,  3,  5, , 17).  The  products  are  the 


leftmost  elements  of  the  array  on  the  right  side. 


17  do 

18  { 

19  ulong  f = tf [ c [i]  ] ; 

20  x *=  f; 

21  tp[i]  = x; 

22  } 

23  while  ( i—  ) ; 

24  } //  here:  final  product  is  x ==  tp[0] 

25 

26  //  visit  the  product  x here 

27 

28  j = C.nextO  ; 

29  } 

30  while  ( j < k ) ; 

The  leftmost  element  of  this  array  is  the  desired  product.  A sentinel  element  at  the  end  of  the  array  is 
used  to  avoid  an  extra  branch  with  the  loop  variable.  With  lexicographic  order  the  update  would  go  from 
left  to  right. 


6.3  Order  by  prefix  shifts  (cool-lex) 


in  [FXT:  class  combination_pref  in  comb/combination-pref.h 


6.3:  Order  by  prefix  shifts  (cool-lex) 
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1: 

1 

1 

11.  . . 

1 

111.  . 

1: 

1111. 

2: 

.1.  . . 

2 

.11.  . 

2 

. 111. 

2: 

.1111 

3: 

. .1.  . 

3 

1.1.  . 

3 

1.11. 

3: 

1.111 

4: 

. . .1. 

4 

.1.1. 

4 

11.1. 

4: 

11.11 

5: 

1 

5 

. .11. 

5 

.11.1 

5: 

111.1 

6 

1.  .1. 

6 

1.1.1 

7 

.1.  .1 

7 

.1.11 

8 

. .1.1 

8 

. . Ill 

9 

. . .11 

9 

1.  .11 

10 

1.  . .1 

10 

11.  .1 

Figure  6.3-A:  Combinations  (jj),  for  k = 1,2,  3,4  in  an  ordering  generated  by  prefix  shifts. 


1111111111111111111111111111 

111111111111111111111 1111111. 

111111111111111 111111 111111 1.. 

1111111111 11111 11111 1 11111 1 1... 

111111 1111 1111. ..1 1111. ..1 1 1111. ..1 1 1 

.111. .111.. .111. .1 111..1...1 111..1...1 1 111..1...1 1 1 

111.11.1.  .11.1.  .1.  . .11.1.  .1.  . .1 11.1..1...1 1 11.1..1...1 1 1 

11.11.1.  .11.1.  .1.  . .11.1.  .1.  . .1 11.1..1...1 1 11.1..1...1 1 1 1 

1.11.1.  .11.1.  .1.  . .11.1.  .1.  . .1 11.1..1...1 1 11.1..1...1 1 1 11 


Figure  6.3-B:  Combinations  (!j)  via  prefix  shifts. 


1 

2 

3 

4 

5 

6 
7 


10 

11 

12 

13 

14 

15 

16 

17 

18 
19 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 


class  combination_pref 

{ 

public : 

ulong  *b_ ; //  data  as  delta  set 

ulong  s_,  t_,  n_;  //  combination  (n  choose  k)  where  n=s+t,  k=t . 

private : 

ulong  x,  y;  //  aux 
public : 

combination_pref (ulong  n,  ulong  k) 

//  Must  have:  n>=2,  k>=l  (i.e.  s ! =0  and  t!=0) 

{ 

s_  = n - k; 

t_  = k; 

n_  = s_  + t_; 

b_  = new  ulong  [n_]  ; 

first () ; 

} 

[ — snip — ] 
void  first () 

for  (ulong  j =0 ; j<n_;  ++ j ) b_ [ j ] = 0; 
for  (ulong  j =0 ; j<t_;  ++ j ) b_  [ j ] = 1; 
x = 0;  y = 0; 

} 

bool  nextO 

{ 

if  ( x==0  ) { x=l;  b_[t_]=l;  b_[0]=0;  return  true;  } 

else 

{ 

if  ( x>=n_-l  ) return  false; 

else 

{ 

b_  [x]  = 0;  ++x;  b_  [y]  = 1;  ++y;  //  X(s,t) 
if  ( b_  [x]  ==0  ) 

{ 

b_  [x]  = 1;  b_  [0]  = 0;  //  Y(s,t) 

if  ( y>l  ) x = 1;  //  Z(s,t) 

Y = 0; 

> 

return  true; 

} 

} 

> 

[ — snip — ] 


The  combinations  (2q)  and  (12)  are  generated  at  a rate  of  about  200  M/s. 
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6.4  Minimal-change  order 


1 

Gray  code 
{0,  1,  2} 

111.  . . 

1 

complemented  Gray 
{ 3,  4,  5 > 

code 
. . .111 

2 

4 

0, 

2 

3 

} 

1.11.  . 

2 

{ 

1, 

4, 

5 

> 

.1. .11 

3 

4 

1, 

2 

3 

> 

.111.  . 

3 

{ 

0, 

4, 

5 

> 

1. . .11 

4 

4 

0, 

1 

3 

} 

11.1. . 

4 

{ 

2, 

4, 

5 

> 

. .1.11 

5 

4 

0, 

3 

4 

} 

1. .11. 

5 

{ 

1, 

2, 

5 

> 

.11.  .1 

6 

4 

1, 

3 

4 

> 

.1.11. 

6 

{ 

0, 

2, 

5 

> 

1.1.  .1 

7 

{ 

2, 

3 

4 

} 

. .111. 

7 

{ 

0, 

1, 

5 

> 

11.  . .1 

8 

4 

0, 

2 

4 

} 

1.1.1. 

8 

{ 

1, 

3, 

5 

> 

.1.1.1 

9 

4 

1, 

2 

4 

} 

.11.1. 

9 

{ 

0, 

3, 

5 

> 

1. .1.1 

10 

4 

0, 

1 

4 

} 

11. .1. 

10 

{ 

2, 

3, 

5 

> 

. .11.1 

11 

4 

0, 

4 

5 

} 

1. . .11 

11 

{ 

1, 

2, 

3 

> 

.111.  . 

12 

4 

1, 

4 

5 

> 

.1.  .11 

12 

{ 

0, 

2, 

3 

> 

1.11.  . 

13 

4 

2, 

4 

5 

} 

. .1.11 

13 

{ 

0, 

1, 

3 

> 

11.1.  . 

14 

4 

3, 

4 

5 

} 

. . .111 

14 

{ 

0, 

1, 

2 

> 

111.  . . 

15 

4 

0, 

3 

5 

} 

1. .1.1 

15 

{ 

1, 

2, 

4 

> 

.11.1. 

16 

4 

1, 

3 

5 

} 

.1.1.1 

16 

{ 

0, 

2, 

4 

> 

1.1.1. 

17 

4 

2, 

3 

5 

} 

. .11.1 

17 

{ 

0, 

1, 

4 

> 

11. .1. 

18 

4 

0, 

2 

5 

} 

1.1.  .1 

18 

{ 

1, 

3, 

4 

> 

.1.11. 

19 

4 

1, 

2 

5 

> 

.11.  .1 

19 

{ 

0, 

3, 

4 

> 

1. .11. 

20 

4 

0, 

1 

5 

> 

11.  . .1 

20 

{ 

2, 

3, 

4 

} 

. .111. 

Figure  6.4-A:  Combinations  (®)  in  Gray  order  (left)  and  complemented  Gray  order  (right). 


The  combinations  of  three  elements  out  of  six  in  a minimal- change  order  (a  Gray  code)  are  shown  in 
figure [674- A (left).  With  each  transition  exactly  one  element  changes  its  position.  We  use  a recursion  for 
the  list  C(n,k)  of  combinations  (notation  as  in  relation  14.1-1  on  page  304 ) : 


C(n,  k ) 


[C(n  — 1,  k)  ] _ [0  . C(n-  1,  k)  ] 

[(n)  . CR(n  — 1,  k — 1)]  [1  . CR(n  — 1,  k — 1)] 


(6.4-1) 


The  first  equality  is  for  the  set  representation,  the  second  for  the  delta-set  representation.  An  implemen- 
tation is  given  in  [FXT:  comb/combination-gray-rec-demo.cc  : 

1 ulong  *x;  //  elements  in  combination  at  x[l]  ...  x [k] 

2 

3 void  comb_gray (ulong  n,  ulong  k,  bool  z) 

4 { 


5 

if 

( k==n  ) 

6 

4 

7 

for  (ulong  j=l ; j<=k;  ++j)  x[j]  = j; 

8 

visit ()  ; 

9 

return; 

10 

> 

11 

12 

if 

( z ) //  forward: 

13 

4 

14 

comb_gray (n-1 , k,  z) ; 

15 

if  ( k>0  ) 4 x [k]  = n;  comb_gray (n-1 , 

k-1 , 

!z);  > 

16 

> 

17 

else  //  backward: 

18 

4 

19 

if  ( k>0  ) 4 x [k]  = n;  comb_gray (n-1 , 

k-1 , 

!z);  > 

20 

21 

22  } 

> 

comb_gray (n-1 , k,  z) ; 

The  recursion  can  be  partly  unfolded  as  follows 


[C(n  — 2,  k)  ] [0  0 . C(n  — 2,  k)  } 

C(n,  k)  = \{n  - 1)  . CR(n  - 2,  k - 1)]  = [01.  CR(n  - 2,  k - 1)] 

[(n)  . CR(n  — 1,  k — 1)  ] [1  . CR(n  — 1,  k — 1)  ] 


(6.4-2) 
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A recursion  for  the  complemented  order  is 


C'(n,k) 


[(n)  . C'(n  - 1,  k - 1)]  _ [1  . C'{n  - 1,  k - 1)] 
[C"R(n  - 1,  k)  } ~ [0  . C"R(n  - 1,  k)  } 


(6.4-3) 


1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

A very  efficient  ( revolving  door)  algorithm  to  generate  the  sets  for  the  Gray  code  is  given  in 
12691.  An  implementation  following  [2151  alg.R,  sect. 7. 2. 1.3]  is  [FXT:  class  combination_revdoor  in 
comb/combination- revdoor.h  . Usage  of  the  class  is  shown  in  [FXT:  comb/combination-revdoor-demo.cc  . 
The  routine  generates  the  combinations  (i^)  at  a rate  of  about  115  M/s,  the  combinations  (^)  are  gen- 
erated at  a rate  of  181  M/s.  An  implementation  geared  for  good  performance  for  small  values  of  k is 
given  in  12231.  a C++  adaptation  is  [FXT:  comb/combination-lam-demo.cc|.  The  combinations  (^)  are 
generated  at  a rate  of  190  M/s  and  the  combinations  (b^)  at  a rate  of  250  M/s.  The  routine  is  limited  to 
values  k > 2. 


void  comb_gray_compl(ulong  n,  ulong  k,  bool  z) 

{ 

[ — snip — ] 

if  ( z ) //  forward: 

if  ( k>0  ) { x [k]  = n;  comb_gray_compl (n-1 , k-1,  z) ; } 

comb_gray_compl(n-l , k,  !z); 

} 

else  //  backward: 

comb_gray_compl(n-l , k,  !z); 

if  ( k>0  ) { x[k]  = n;  comb_gray_compl (n-1 , k-1,  z) ; } 

} 

} 


6.5  The  Eades-McKay  strong  minimal-change  order 


In  any  Gray  code  order  for  combinations  just  one  element  is  moved  between  successive  combinations. 
When  an  element  is  moved  across  any  other,  there  is  more  than  one  change  on  the  set  representation.  If 
i elements  are  crossed,  then  i + 1 entries  in  the  set  change: 


set 

{ 0,  1,  2,  3 } 
{ 1,  2,  3,  4 } 


delta  set 
mi. . 

. mi . 


A strong  minimal- change  order  is  a Gray  code  where  only  one  entry  in  the  set  representation  is  changed 
per  step.  That  is,  only  zeros  in  the  delta  set  representation  are  crossed,  the  moves  are  called  homogeneous. 
One  such  order  is  the  Eades-McKay  sequence  described  in  [134] . The  Eades-McKay  sequence  for  the 
combinations  (1)  is  shown  in  figure  6.5-A  (left). 


6.5.1  Recursive  generation 

The  Eades-McKay  order  can  be  generated  with  the  program  [FXT:  comb/combination-emk-rec-demo.cc  : 
1 ulong  *rv;  //  elements  in  combination  at  rv[l]  ...  rv[k] 

3 void 

4 comb_emk (ulong  n,  ulong  k,  bool  z) 

{ 


6 

if 

( k= 

=n  ) 

7 

{ 

8 

for 

(ulong  j=l ; j<=k; 

++j) 

rv[j]  = j; 

9 

visit ()  ; 

10 

return; 

11 

} 

12 

13 

if 

( z 

) //  forward: 

14 

{ 

15 

if 

( (n>=2)  &&  (k>=2) 

) { 

rv  [k]  = n ; 

rv[k-l]  = n- 

16 

if 

( (n>=2)  &&  (k>=l) 

) { 

rv  [k]  = n ; 

comb_emk(n-2 

17 

if 

( (n>=l)  ) 

{ 

comb_emk(n- 

1,  k,  z) ; } 

18 

} 

19 

else 

//  backward: 

comb_emk(n-2 , k-2,  z) ; } 
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Eades-McKay  complemented  Eades-McKay 


1 

{ 

4, 

5, 

6 

> 

111 

1 

{ 

4, 

5, 

6 

> 

111 

2 

{ 

3, 

5, 

6 

> 

. . .1.11 

2 

{ 

3, 

5, 

6 

> 

. . .1.11 

3 

{ 

2, 

5, 

6 

> 

. .1. .11 

3 

{ 

2, 

5, 

6 

> 

. .1. .11 

4 

{ 

1, 

5, 

6 

> 

.1.  . .11 

4 

{ 

1, 

5, 

6 

> 

.1.  . .11 

5 

{ 

0, 

5, 

6 

> 

1 11 

5 

{ 

0, 

5, 

6 

> 

1 11 

6 

{ 

0, 

1, 

6 

> 

11 1 

6 

{ 

0, 

4, 

6 

> 

1.  . .1.1 

7 

{ 

0, 

2, 

6 

> 

1.1.  . .1 

7 

{ 

1, 

4, 

6 

> 

.1.  .1.1 

8 

{ 

1, 

2, 

6 

> 

.11.  . .1 

8 

{ 

2, 

4, 

6 

> 

. .1.1.1 

9 

{ 

1, 

3, 

6 

> 

.1.1.  .1 

9 

{ 

3, 

4, 

6 

> 

. . .11.1 

10 

{ 

0, 

3, 

6 

> 

1.  .1.  .1 

10 

{ 

2, 

3, 

6 

> 

. .11.  .1 

11 

{ 

2, 

3, 

6 

> 

. .11.  .1 

11 

{ 

1, 

3, 

6 

> 

.1.1.  .1 

12 

{ 

2, 

4, 

6 

> 

. .1.1.1 

12 

{ 

0, 

3, 

6 

> 

1.  .1.  .1 

13 

{ 

1, 

4, 

6 

> 

.1. .1.1 

13 

{ 

0, 

2, 

6 

> 

1.1.  . .1 

14 

{ 

0, 

4, 

6 

> 

1. . .1.1 

14 

{ 

1, 

2, 

6 

> 

.11.  . .1 

15 

{ 

3, 

4, 

6 

> 

. . .11.1 

15 

{ 

0, 

1, 

6 

> 

11 1 

16 

{ 

3, 

4, 

5 

> 

. . .111. 

16 

{ 

0, 

1, 

5 

> 

11.  . .1. 

17 

{ 

2, 

4, 

5 

> 

. .1.11. 

17 

{ 

0, 

2, 

5 

> 

1.1. .1. 

18 

{ 

1, 

4, 

5 

> 

.1. .11. 

18 

{ 

1, 

2, 

5 

> 

.11. .1. 

19 

{ 

0, 

4, 

5 

> 

1. . .11. 

19 

{ 

2, 

3, 

5 

> 

. .11.1. 

20 

{ 

0, 

1, 

5 

> 

11. . .1. 

20 

{ 

1, 

3, 

5 

> 

.1.1.1. 

21 

{ 

0, 

2, 

5 

> 

1.1. .1. 

21 

{ 

0, 

3, 

5 

> 

1. .1.1. 

22 

{ 

1, 

2, 

5 

> 

.11. .1. 

22 

{ 

0, 

4, 

5 

> 

1. . .11. 

23 

{ 

1, 

3, 

5 

> 

.1.1.1. 

23 

{ 

1, 

4, 

5 

> 

.1. .11. 

24 

{ 

0, 

3, 

5 

> 

1. .1.1. 

24 

{ 

2, 

4, 

5 

> 

. .1.11. 

25 

{ 

2, 

3, 

5 

> 

. .11.1. 

25 

{ 

3, 

4, 

5 

> 

. . .111. 

26 

{ 

2, 

3, 

4 

> 

. .111.  . 

26 

{ 

2, 

3, 

4 

> 

. .111.  . 

27 

{ 

1, 

3, 

4 

> 

.1.11.  . 

27 

{ 

1, 

3, 

4 

> 

.1.11.  . 

28 

{ 

0, 

3, 

4 

> 

1. .11.  . 

28 

{ 

0, 

3, 

4 

> 

1.  .11.  . 

29 

{ 

0, 

1, 

4 

> 

11. .1.  . 

29 

{ 

0, 

2, 

4 

> 

1.1.1.  . 

30 

{ 

0, 

2, 

4 

> 

1.1.1.  . 

30 

{ 

1, 

2, 

4 

> 

.11.1.  . 

31 

{ 

1, 

2, 

4 

> 

.11.1.  . 

31 

{ 

0, 

1, 

4 

> 

11.  .1.  . 

32 

{ 

1, 

2, 

3 

> 

.111.  . . 

32 

{ 

0, 

1, 

3 

> 

11.1.  . . 

33 

{ 

0, 

2, 

3 

> 

1.11.  . . 

33 

{ 

0, 

2, 

3 

> 

1.11.  . . 

34 

{ 

0, 

1, 

3 

> 

11.1.  . . 

34 

{ 

1, 

2, 

3 

> 

.111.  . . 

35 

{ 

0, 

1, 

2 

> 

Ill 

35 

{ 

0, 

1, 

2 

> 

Ill 

Figure  6. 5- A:  Combinations  in  Eades-McKay  order  (left)  and  complemented  Eades-Mckay  order  (right). 


20  1 

21  if  ( (n>=l)  ) { comb_emk(n-l , k,  z) ; } 

22  if  ( (n>=2)  &&  (k>=l)  ) { rv[k]  = n;  comb_emk(n-2 , k-1,  !z);  } 

23  if  ( (n>=2)  &&  (k>=2)  ) { rv[k]  = n;  rv[k-l]  = n-1;  comb_emk(n-2 , k-2,  z) ; } 

24  } 

25  } 


The  combinations  (^g)  are  generated  at  a rate  of  about  44  million  per  second,  the  combinations  (’j^)  at 
a rate  of  34  million  per  second. 


The  underlying  recursion  for  the  list  E{n , k)  of  combinations  (™)  is  (notation  as  in  relation 
page  304) 


14.1-1 


on 


[(n)  . (n  - 1)  . E(n  - 2,  k - 2)]  [1 1 .E(n-  2,  k-2)  } 

E(n,  k)  = [(n)  . ER(n  — 2,  k — 1)  ] = [1 0 . ER(n  - 2,  k - 1)]  (6.5-1) 

[E[n  — 1,  k)  ] [0  . E(n  — 1,  k) 


Again,  the  first  equality  is  for  the  set  representation,  the  second  for  the  delta-set  representation.  Counting 
the  elements  on  both  sides  gives  the  relation 


+ 


(6.5-2) 


which  is  an  easy  consequence  of  relation  6.1-3  on  page  177 


A recursion  for  the  complemented  sequence 
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(with  respect  to  the  delta  sets)  is 

[(n)  . E'(n  — 1,  k — 1) 


E'(n,k)  = 

Counting  on  both  sides  gives 


[(n  - 1)  . E,R(n  -2  ,k-  1)]  = [01.  E'R(n  -2  ,k-  1)] 
[E'(n-  2,  k) 


[1  . E’(n  — 1,  k - 1) 
[01  . E'R(n~  2,  k 
[00  . E'(n-  2,  k) 


n — 2 
k 


+ 


n — 2 

A-  1 


n — 1 

k-  1 


The  condition  for  the  recursion  end  has  to  be  modified: 

1 void 

2 comb_emk_compl (along  n,  ulong  k,  bool  z) 

3 { 

4 if  ( (k==0)  | | (k==n)  ) 

5 { 

6 for  (ulong  j=l;  j<=k;  ++j)  rv[j]  = j; 

7 ++ct ; 

8 visit  ()  ; 

9 return; 


(6.5-3) 


(6.5-4) 


10 

} 

11 

12 

if 

( z 

) 

//  forward: 

13 

{ 

14 

if 

( 

(n>=l)  &&  (k>=l) 

) 

{ 

rv  [k] 

= n;  comb_emk_compl (n-1 , k-1,  z) 

; > 

// 

1 

15 

if 

( 

(n>=2)  &&  (k>=l) 

) 

{ 

rv  [k] 

= n-1;  comb_emk_compl (n-2 , k-1, 

!z)  ; 

> 

// 

01 

16 

if 

( 

(n>=2)  ) 

{ 

o 

0 

1 

emk_compl(n-2,  k-0,  z) ; } 

// 

00 

17 

> 

18 

else 

//  backward: 

19 

20 

if 

( 

(n>=2)  ) 

{ 

o 

0 

1 

emk_compl(n-2,  k-0,  z) ; } 

// 

00 

21 

if 

( 

(n>=2)  &&  (k>=l) 

) 

{ 

rv  [k] 

= n-1;  comb_emk_compl (n-2 , k-1, 

! z)  ; 

> 

// 

01 

22 

if 

( 

(n>=l)  &&  (k>=l) 

) 

{ 

rv  [k] 

= n;  comb_emk_compl (n-1 , k-1,  z) 

; > 

// 

1 

23 

> 

24  } 

The  complemented  sequence  is  not  a strong  Gray  code. 


6.5.2  Iterative  generation  via  modulo  moves 

An  iterative  algorithm  for  the  Eades-McKay  sequence  is  given  in  [FXT:  class  combination_emk  in 
comb/combination-emk.h  : 

1 class  combination_emk 

2 { 

3 public: 

4 ulong  *x_ ; //  combination:  k elements  0<=x[j]<k  in  increasing  order 

5 ulong  *s_;  //  aux:  start  of  range  for  moves 

6 ulong  *a_;  //  aux:  actual  start  position  of  moves 

7 ulong  n_,  k_;  //  Combination  (n  choose  k) 


public : 


10 

combination_emk(ulong  n, 

ulong  k) 

11 

{ 

12 

n = n; 

13 

k_  = k; 

14 

x_  = new  ulong [k_+l] ; 

// 

incl.  high  sentinel 

15 

s_  = new  ulong [k_+l] ; 

// 

incl.  high  sentinel 

16 

a_  = new  ulong [k_] ; 

17 

x_  [k_]  = n_ ; 

18 

first () ; 

19 

} 

20 

[— 

snip — ] 

21 

22 

void  first () 

23 

24 

for  (ulong  j =0 ; 

j<k_; 

++j) 

x-[j]  = i; 

25 

for  (ulong  j =0 ; 

j<k_; 

++j) 

s_fj]  = j; 

26 

for  (ulong  j =0 ; 

j <k_ ; 

++j) 

a_[j]  = x_[j]  ; 

27 

> 
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The  computation  of  the  successor  uses  modulo  steps: 


1 

ulong  next  0 

2 

// 

Return  position  where  track  changed,  return  ] 

3 

1 

4 

ulong  j = k_; 

5 

while  ( j — ) //  loop  over  tracks 

6 

{ 

7 

const  ulong  sj  = s_[j]; 

8 

9 

const  ulong  m = x_[j  + l]  - s j - 1 ; 

10 

if  ( 0 ! =m  ) //  unless  range  empty 

11 

{ 

12 

13 

ulong  u = x_[j]  - sj  ; 

14 

//  modulo  moves: 

15 

if  ( 0==  (j&l)  ) 

16 

{ 

17 

++u; 

18 

if  ( u>m  ) u = 0; 

19 

> 

20 

else 

21 

{ 

22 

— u; 

if  ( u>m  ) u = m; 

23 

24 

> 

25 

26 

u +=  s j ; 

27 

if  ( u !=  a [j]  ) //  next  position 

28 

{ 

29 

x_[j]  = u; 

30 

s_[j  + l]  = u+1; 

31 

return  j ; 

32 

> 

33 

} 

34 

a_[j]  = x_[j]  ; 

35 

36 

} 

37 

return  k_;  //  current  combination  is  last 

38 

39  }; 

} 

start  position 


10 


The  combinations  (2g)  are  generated  at  a rate  of  about  60  million  per  second,  the  combinations  (^2)  at 
a rate  of  85  million  per  second  [FXT:  comb/combination-emk-demo.cc  . 

6.5.3  Alternative  order  via  modulo  moves 

A slight  modification  of  the  successor  computation  gives  an  ordering  where  the  first  and  last  combination 
differ  by  a single  transposition  (though  not  a homogeneous  one),  see  figure  6.5-B  The  generator  is  given 
in  [FXT:  class  combination_mod  in  comb/combination-mod. h : 

class  combination_mod 
{ 

[ — snip — ] 
ulong  next () 

1 

[ — snip — ] 

//  modulo  moves: 

//  if  ( 0== ( j&l)  ) //  gives  EMK 

if  ( 0!=(j&l)  ) //  mod 

[ — snip — ] 

The  rate  of  generation  is  identical  with  the  EMK  order  [FXT:  comb/combination-mod-demo. cc  . 

6.6  Two-close  orderings  via  endo/enup  moves 


6.6.1  The  endo  and  enup  orderings  for  numbers 

The  endo  order  of  the  set  {0, 1,  2, . . . ,m}  is  obtained  by  writing  all  odd  numbers  of  the  set  in  increasing 
order  followed  by  all  even  numbers  in  decreasing  order:  {1,  3,  5, . . . , 6, 4,  2,  0}.  The  term  endo  stands 
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mod 

EMK 

mod 

EMK 

1: 

111 

111 

1 

1111. . . 

1111. . . 

2: 

11 1 

11.1. . . 

2 

111.1.  . 

111. . .1 

3: 

11.  . .1. 

11.  .1. . 

3 

111. .1. 

111. .1. 

4: 

11. .1.  . 

11. . .1. 

4 

111.  . . 1 

111.1.  . 

5: 

11.1.  . . 

11 1 

5 

11. . .11 

11.11.  . 

6: 

1.11.  . . 

1 11 

6 

11.  .1.1 

11.1.  .1 

7: 

1.1.  . .1 

1.  . .1.1 

7 

11. .11. 

11.1.1. 

8: 

1.1.  .1. 

1.  . .11. 

8 

11.1.1. 

11.  .11. 

9: 

1.1.1.  . 

1. .1.1. 

9 

11.1.  .1 

11.  .1.1 

10: 

1. .11.  . 

1. .1.  .1 

10 

11.11.  . 

11. . .11 

11: 

1.  .1.  .1 

1. .11. . 

1 1 

1.111.  . 

1. . .111 

12: 

1. .1.1. 

1.1.1.  . 

12 

1.11.1. 

1. .1.11 

13: 

1. . .11. 

1.1. .1. 

13 

1.11.  .1 

1.  .11.1 

14: 

1. . .1.1 

1.1. . .1 

14 

1.1. .11 

1. .111. 

15: 

1 11 

1.11.  . . 

15 

1.1. 1.1 

1.1.11. 

16: 

111 

.111.  . . 

16 

1.1.11. 

1.1. 1.1 

17: 

. . .1.11 

.11.1.  . 

17 

1. .111. 

1.1. .11 

18: 

. . .11.1 

.11. .1. 

18 

1.  .11.1 

1.11.  .1 

19: 

...  111. 

.11.  . .1 

19 

1. .1.11 

1.11.1. 

20: 

. .1.11. 

.1. . .11 

20 

1. . .111 

1.111.  . 

21: 

. .1.1.1 

.1. .1.1 

21 

. . .1111 

.1111.  . 

22: 

. .1.  .11 

.1. .11. 

22 

. .1.111 

.111.  .1 

23: 

. .11.  .1 

.1.1.1. 

23 

. .11.11 

.111.1. 

24: 

. .11.1. 

.1.1.  .1 

24 

. .111.1 

.11.11. 

25: 

. . 111.  . 

.1.11.  . 

25 

. .1111. 

.11.1.1 

26: 

.1.11.  . 

. .111.  . 

26 

.1.111. 

.11. .11 

27: 

.1.1.  .1 

. .11.1. 

27 

.1.11.1 

.1. .111 

28: 

.1.1.1. 

. .11. .1 

28 

.1.1.11 

.1.1.11 

29: 

.1. .11. 

. .1.  .11 

29 

.1. .111 

.1.11.1 

30: 

.1. .1.1 

. .1.1.1 

30 

.11. .11 

.1.111. 

31: 

.1.  . .11 

. .1.11. 

31 

.11.1.1 

. .1111. 

32: 

.11.  . .1 

. . .111. 

32 

.11.11. 

. .111.1 

33: 

.11. .1. 

. . .11.1 

33 

.111.1. 

. .11.11 

34: 

.11.1.  . 

. . .1.11 

34 

.111.  .1 

. .1.111 

35: 

.111.  . . 

111 

35 

.1111.  . 

. . .1111 

Figure  6.5-B:  All  combinations  (g)  (left)  and  Q 

) 

(right)  in  mod  order  and  EMK  order. 

m 

endo  sequence  m 

enup  sequence 

i 

1 0 

1 

0 

1 

2 

1 2 0 

2 

0 

2 

1 

3 

13  2 0 

3 

0 

2 

3 1 

4 

13  4 2 

0 

4 

0 

2 

4 3 1 

5 

13  5 4 

2 

0 

5 

0 

2 

4 5 3 1 

6 

13  5 6 

4 

2 

0 6 

0 

2 

4 6 5 3 1 

7 

13  5 7 

6 

4 

2 0 7 

0 

2 

4 6 7 5 3 

1 

8 

13  5 7 

8 

6 

4 2 0 8 

0 

2 

4 6 8 7 5 

3 

1 

9 

13  5 7 

9 

8 

6 4 2 0 9 

0 

2 

4 6 8 9 7 

5 

3 1 

Figure  6.6-A: 

The  endo  (left)  and  enup 

(right)  orderings  with  maximal  value  m. 

for  ‘Even  Numbers  DOwn,  odd  numbers  up’.  A routine  for  generating  the  successor  in  endo  order  with 
maximal  value  m is  [FXT:  comb/endo-enup.h  : 

1 inline  ulong  next_endo (ulong  x,  ulong  m) 

2 //  Return  next  number  in  endo  order 

3 { 

4 if  ( x & 1 ) //x  odd 

5 { 

6 x +=  2; 

7 if  ( x>m  ) x = m - (m&l) ; //  ==  max  even  <=  m 

8 } 

9 else  //  x even 

10  { 

11  x = ( x==0  ? 1 : x-2  ) ; 

12  > 

13  return  x; 

14  } 

The  sequences  for  the  first  few  m are  shown  in  figure  [in 6-  A [ The  routine  computes  one  for  the  input  zero. 

An  ordering  starting  with  the  even  numbers  in  increasing  order  will  be  called  enup  (for  ‘Even  Numbers 
UP,  odd  numbers  down’).  The  computation  of  the  successor  can  be  implemented  as 

1 static  inline  ulong  next _enup (ulong  x,  ulong  m) 

2 { 
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3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


1 

2 


1 

2 

3 

4 

5 

6 


1 

2 

3 

4 

5 

6 


1 

2 

3 

4 

5 

6 

7 

8 


1 

2 

3 

4 

5 

6 


1 

2 


if  ( x & 1 ) //  x odd 

1 

x = ( x==l  ? 0 : x-2  ) ; 

} 

else  //  x even 
x +=  2; 

if  ( x>m  ) x = m - ! (m&l)  ; //  max  odd  <=m 

} 

return  x; 

} 

The  orderings  are  reversals  of  each  other,  so  we  define: 

static  inline  ulong  prev_endo (ulong  x,  ulong  m)  { return  next_enup(x,  m) ; } 
static  inline  ulong  prev_enup (ulong  x,  ulong  m)  { return  next_endo(x,  m) ; } 

A function  that  returns  the  x-th  number  in  enup  order  with  maximal  digit  m is 

static  inline  ulong  enup_num (ulong  x,  ulong  m) 

{ 

ulong  r = 2*x; 

if  ( r>m  ) r = 2*m+l  - r; 

return  r; 

} 

The  function  will  only  work  if  x < m.  For  example,  with  m = 5: 

x:  012345 

r:  024531 

The  inverse  function  is 

static  inline  ulong  enup_idx (ulong  x,  ulong  m) 

{ 

const  ulong  b = x & 1; 
x »=  1; 

return  ( b ? m-x  : x ) ; 

} 

The  function  to  map  into  endo  order  is 

static  inline  ulong  endo_num (ulong  x,  ulong  m) 

{ 

//  return  enup_num(m-x , m) ; 

x = m - x ; 

ulong  r = 2*x; 

if  ( r>m  ) r = 2*m+l  - r; 

return  r ; 

} 

For  example, 

x:  012345 

r:  135420 

Its  inverse  is 

static  inline  ulong  endo_idx (ulong  x,  ulong  m) 

{ 

const  ulong  b = x & 1; 
x »=  1; 

return  ( b ? x : m-x  ) ; 

} 


6.6.2  The  endo  and  enup  orderings  for  combinations 


Two  strong  minimal-change  orderings  for  combinations  can  be  obtained  via  moves  in  enup  and  endo 
order.  Figure  6.6-B  shows  an  ordering  where  the  moves  to  the  right  are  on  even  positions  (enup  order, 
left).  If  the  moves  to  the  right  are  on  odd  positions  (endo  order),  then  Chase’s  sequence  is  obtained 
(right).  Both  have  the  property  of  being  two-close : an  element  in  the  delta  set  moves  by  at  most  two 
positions  (and  the  move  is  homogeneous,  no  other  element  is  crossed).  An  implementation  of  an  iterative 
algorithm  for  the  computation  of  the  combinations  in  enup  order  is  [FXT:  class  combination_enup  in 
comb/combination-enup.h  . 


class  combination_enup 
{ 
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IS 


enup 

moves 

endo 

moves 

1 

{ 

0, 

1 

2 

> 

111 

1 

{ 

0, 

1 

2 

> 

111 

2 

{ 

0, 

1 

4 

> 

11. .1.  . . 

2 

{ 

0, 

1 

3 

} 

11.1 

3 

{ 

0, 

1 

6 

} 

11 1. 

3 

{ 

0, 

1 

5 

} 

11.  . .1. . 

4 

{ 

0, 

1 

7 

> 

11 1 

4 

{ 

0, 

1 

7 

} 

11 1 

5 

{ 

0, 

1 

5 

> 

11. . .1.  . 

5 

{ 

0, 

1 

6 

} 

11 1. 

6 

{ 

0, 

1 

3 

> 

11.1 

6 

{ 

0, 

1 

4 

> 

11.  .1. . . 

7 

{ 

0, 

2 

3 

> 

1.11 

7 

{ 

0, 

3 

4 

> 

1. .11.  . . 

8 

{ 

0, 

2 

4 

> 

1.1.1.  . . 

8 

{ 

0, 

3 

5 

} 

1. .1.1.  . 

9 

{ 

0, 

2 

6 

> 

1.1. . .1. 

9 

{ 

0, 

3 

7 

} 

1.  .1.  . . 1 

10 

{ 

0, 

2 

7 

} 

1.1 1 

10 

{ 

0, 

3 

6 

} 

1. .1. .1. 

11 

{ 

0, 

2 

5 

> 

1.1. .1.  . 

11 

{ 

0, 

5 

6 

> 

1 11. 

12 

{ 

0, 

4 

5 

> 

1. . .11.  . 

12 

{ 

0, 

5 

7 

} 

1 1.1 

13 

{ 

0, 

4 

6 

} 

1. . .1.1. 

13 

{ 

0, 

6 

7 

> 

1 11 

14 

{ 

0, 

4 

7 

> 

1.  . .1.  .1 

14 

{ 

0, 

4 

7 

> 

1.  . .1.  .1 

15 

{ 

0, 

6 

7 

> 

1 11 

15 

{ 

0, 

4 

6 

> 

1. . .1.1. 

16 

{ 

0, 

5 

7 

} 

1 1.1 

16 

{ 

0, 

4 

5 

} 

1.  . .11.  . 

17 

{ 

0, 

5 

6 

> 

1 11. 

17 

{ 

0, 

2 

5 

} 

1.1.  .1.  . 

18 

{ 

0, 

3 

6 

> 

1. .1.  .1. 

18 

{ 

0, 

2 

7 

> 

1.1 1 

19 

{ 

0, 

3 

7 

} 

1.  .1.  . .1 

19 

{ 

0, 

2 

6 

} 

1.1.  . .1. 

20 

{ 

0, 

3 

5 

> 

1. .1.1.  . 

20 

{ 

0, 

2 

4 

} 

1.1.1.  . . 

21 

{ 

0, 

3 

4 

> 

1.  . 11. . . 

21 

{ 

0, 

2 

3 

} 

1.11 

22 

{ 

2, 

3 

4 

} 

. .111.  . . 

22 

{ 

1, 

2 

3 

} 

.111 

23 

{ 

2, 

3 

6 

> 

. .11.  .1. 

23 

{ 

1, 

2 

5 

} 

.11.  .1.  . 

24 

{ 

2, 

3 

7 

> 

. .11.  . .1 

24 

{ 

1, 

2 

7 

} 

.11 1 

25 

{ 

2, 

3 

5 

> 

. .11.1.  . 

25 

{ 

1, 

2 

6 

> 

.11.  . .1. 

26 

{ 

2, 

4 

5 

> 

. .1.11.  . 

26 

{ 

1, 

2 

4 

> 

.11.1.  . . 

27 

{ 

2, 

4 

6 

> 

. .1.1.1. 

27 

{ 

1, 

3 

4 

> 

.1.11.  . . 

28 

{ 

2, 

4 

7 

} 

. .1.1.  .1 

28 

{ 

1, 

3 

5 

} 

.1.1.1.  . 

29 

{ 

2, 

6 

7 

} 

. .1. . .11 

29 

{ 

1, 

3 

7 

} 

.1.1.  . .1 

30 

{ 

2, 

5 

7 

> 

. .1. .1.1 

30 

{ 

1, 

3 

6 

} 

.1.1.  .1. 

31 

{ 

2, 

5 

6 

> 

. .1. .11. 

31 

{ 

1, 

5 

6 

> 

.1. . .11. 

32 

{ 

4, 

5 

6 

} 

111. 

32 

{ 

1, 

5 

7 

} 

.1.  . .1.1 

33 

{ 

4, 

5 

7 

> 

11.1 

33 

{ 

1, 

6 

7 

} 

.1 11 

34 

{ 

4, 

6 

7 

> 

1.11 

34 

{ 

1, 

4 

7 

> 

.1.  .1.  .1 

35 

{ 

5, 

6 

7 

} 

Ill 

35 

{ 

1, 

4 

6 

> 

.1.  .1.1. 

36 

{ 

3, 

6 

7 

> 

. . .1.  .11 

36 

{ 

1, 

4 

5 

> 

.1.  .11.  . 

37 

{ 

3, 

5 

7 

> 

..  .1.1.1 

37 

{ 

3, 

4 

5 

} 

. . .111.  . 

38 

{ 

3, 

5 

6 

} 

. . .1.11. 

38 

{ 

3, 

4 

7 

> 

. . .11.  .1 

39 

{ 

3, 

4 

6 

> 

. . .11.1. 

39 

{ 

3, 

4 

6 

} 

. . .11.1. 

40 

{ 

3, 

4 

7 

> 

. . .11.  .1 

40 

{ 

3, 

5 

6 

> 

. . .1.11. 

41 

{ 

3, 

4 

5 

} 

...  111.  . 

41 

{ 

3, 

5 

7 

> 

..  .1.1.1 

42 

{ 

1, 

4 

5 

> 

.1.  .11.  . 

42 

{ 

3, 

6 

7 

} 

. . .1.  .11 

43 

{ 

1, 

4 

6 

} 

.1.  .1.1. 

43 

{ 

5, 

6 

7 

} 

Ill 

44 

{ 

1, 

4 

7 

> 

.1.  .1.  .1 

44 

{ 

4, 

6 

7 

> 

1.11 

45 

{ 

1, 

6 

7 

} 

.1 11 

45 

{ 

4, 

5 

7 

> 

11.1 

46 

{ 

1, 

5 

7 

> 

.1. . .1.1 

46 

{ 

4, 

5 

6 

} 

111. 

47 

{ 

1, 

5 

6 

> 

.1.  . .11. 

47 

{ 

2, 

5 

6 

> 

. .1.  .11. 

48 

{ 

1, 

3 

6 

> 

.1.1.  .1. 

48 

{ 

2, 

5 

7 

} 

. .1.  .1.1 

49 

{ 

1, 

3 

7 

> 

.1.1.  . .1 

49 

{ 

2, 

6 

7 

> 

. .1.  . .11 

50 

{ 

1, 

3 

5 

> 

.1.1.1.  . 

50 

{ 

2, 

4 

7 

> 

. .1.1.  .1 

51 

{ 

1, 

3 

4 

> 

.1.11.  . . 

51 

{ 

2, 

4 

6 

} 

. .1.1.1. 

52 

{ 

1, 

2 

4 

} 

.11.1.  . . 

52 

{ 

2, 

4 

5 

} 

. .1.11.  . 

53 

{ 

1, 

2 

6 

> 

.11.  . .1. 

53 

{ 

2, 

3 

5 

> 

. .11.1.  . 

54 

{ 

1, 

2 

7 

} 

.11 1 

54 

{ 

2, 

3 

7 

} 

. .11.  . .1 

55 

{ 

1, 

2 

5 

> 

.11.  .1.  . 

55 

{ 

2, 

3 

6 

> 

. .11.  .1. 

56 

{ 

1, 

2 

3 

> 

.111 

56 

{ 

2, 

3 

4 

> 

. .111.  . . 

Figure  6.6-B:  Combinations  (®)  via  enup  moves 

(left)  and 

via  endo 

moves  (Chase’s  sequence,  right) 
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17 

18 

19 

20 
21 


public : 

ulong  *x_ ; //  combination:  k elements  0<=x[j]<k  in  increasing  order 

ulong  *s_;  //  aux:  start  of  range  for  enup  moves 

ulong  *a_;  //  aux:  actual  start  position  of  enup  moves 

ulong  n_,  k_;  //  Combination  (n  choose  k) 


public : 

combination_enup(ulong  n,  ulong  k) 

1 

n_  = n; 
k_  = k; 

x_  = new  ulong [k_+l] ; //  incl.  padding  x_ [k] 
s_  = new  ulong [k_+l] ; //  incl.  padding  x_ [k] 
a_  = new  ulong [k_] ; 
x_ [k_]  = n_; 
f irst  ()  ; 

> 


[ — snip — ] 
void  first () 

for  (ulong  j =0 ; j<k_;  ++j)  x_[j]  = j; 

for  (ulong  j =0 ; j<k_;  ++j)  s_[j]  = j; 

for  (ulong  j =0 ; j<k_;  ++j)  a_[j]  = x_[j]; 

> 


The  ‘padding1  elements  x [k]  and  s [k]  allow  omitting  a branch,  similar  to  sentinel  elements.  The  successor 
of  the  current  combination  is  computed  by  finding  the  range  of  possible  movements  (variable  m)  and,  unless 
the  range  is  empty,  move  until  we  are  back  at  the  start  position: 


1 

2 

3 

4 

5 

6 
7 


9 

10 

11 

12 

13 

14 

15 

16 


i; 


19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 


ulong  nextQ 

//  Return  position  where  track  changed,  return  k with  last  combination 

{ 

ulong  j = k_; 

while  ( j — ) //  loop  over  tracks 

{ 

const  ulong  sj  = s_[j]; 

const  ulong  m = x_[j  + l]  - s j - 1 ; 

if  ( 0 ! =m  ) //  unless  range  empty 

{ 

ulong  u = x_[j]  - sj  ; 

//  move  right  on  even  positions: 

if  ( 0==(sj&l)  ) u = next_enup(u,  m) ; 

else  u = next_endo(u,  m) ; 

u +=  sj  ; 

if  ( u !=  a_[j]  ) //  next  pos  !=  start  position 

{ 

x_[j]  = u; 
s_[j+l]  = u+1; 
return  j ; 

> 

} 

a_[j]  = x_[j]  ; 

} 

return  k_;  //  current  combination  is  last 

} 

}; 


The  combinations  (j^)  are  generated  at  a rate  of  45  million  objects  per  second,  the  combinations  (’y2)  at 
a rate  of  55  million  per  second.  The  only  change  in  the  implementation  for  computing  the  endo  ordering 
is  (at  the  obvious  place  in  the  code)  [FXT:  comb/combination-endo.h  : 

1 
2 
3 


//  move  right  on  odd  positions: 

if  ( 0==(sj&l)  ) u = next_endo(u,  m) ; 

else  u = next_enup(u,  m) ; 


The  ordering  with  endo  moves  is  called  Chase’s  sequence.  Figure [6.6-B| was  created  with  the  programs 


6. 7:  Recursive  generation  of  certain  orderings 
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[FXT:  comb/combination-enup-demo.cc  and  [FXT:  comb/combination-endo-demo.cc|. 

The  underlying  recursion  for  the  list  U(n,  k ) of  combinations  (^)  in  enup  order  is 

[(n)  . (n  - 1)  . U(n  -2  ,k-  2)]  [1 1 ,U{n-  2,  k - 2)] 

U(n,k)  = [(n)  . U(n  — 2,  k - 1)  ] = [1 0 . U{n  - 2,  k - 1)]  (6.6-1) 

[Un{n-l,k)  ] [O.UR(n~l,k)  } 

The  recursion  is  very  similar  to  relation  |6.5-1  on  page  184}  The  crucial  part  of  the  recursive  routine  is 
[FXT:  comb/combination-enup-rec-demo.cc  : 

1 void 

2 comb_emip(ulong  n,  ulong  k,  bool  z) 

3 { 

4 if  ( k==n  ) { visit  ();  return;  } 

5 

6 if  ( z ) //  forward: 

7 { 


8 

if 

( 

(n>=2)  kk  (k>=2) 

) 

{ 

rv  [k]  = n ; 

rvfk-1]  = n-1; 

comb_ 

enup(n-2 , 

'n 

CN 

1 

} 

9 

if 

( 

(n>=2)  kk  (k>=l) 

) 

{ 

rv  [k]  = n ; 

comb_enup(n-2, 

k-1 , z 

);  > 

10 

if 

( 

(n>=l)  ) 

{ 

comb_enup(n- 

-1 , k,  ! z)  ; 1 

11 

> 

12 

else 

//  backward: 

13 

1 

14 

if 

( 

(n>=l)  ) 

{ 

comb_enup (n- 

-1 , k,  ! z) ; 1 

15 

if 

( 

(n>=2)  kk  (k>=l) 

) 

{ 

rv  [k]  = n ; 

comb_enup(n-2, 

k-1 , z 

);  } 

16 

if 

( 

(n>=2)  &&  (k>=2) 

) 

{ 

rv  [k]  = n ; 

rv[k-l]  = n-1; 

comb_ 

enup(n-2 , 

k-2 , z) ; 

} 

17  > 

18  } 

A recursion  for  the  complemented  sequence  (with  respect  to  the  delta  sets)  is 

[(n)  . U'R(n  - 1,  k - 1)  ] [1  . U'R(n  - 1 ,k-  1)] 

U'{n,k)  = [(n  — 1)  . U'{n  — 2,  fe  — 1)]  = [01.  U'(n  — 2,  fc  — 1)]  (6.6-2) 

[[/'(n-2,  k)  } [0  0 . U'{n  — 2,  k)  ] 

The  condition  for  the  recursion  end  has  to  be  modified: 

1 void 

2 comb_enup_compl (ulong  n,  ulong  k,  bool  z) 

3 { 

4 if  ( (k==0)  I I (k==n)  ) { visitO;  return;  } 

5 

6 if  ( z ) //  forward: 

7 { 


8 

if 

( 

(n>=l) 

&& 

(k>=l) 

) 

{ 

rv [k]  = n;  comb_enup_compl(n-l , k-1,  !z); 

> 

// 

l 

9 

if 

( 

(n>=2) 

&& 

(k>=l) 

) 

{ 

rv [k]  = n-1;  comb_enup_compl(n-2,  k-1,  z) 

; > 

// 

01 

10 

if 

( 

(n>=2) 

) 

{ 

comb_enup_compl (n-2 , k-0,  z) ; } 

// 

00 

11 

> 

12 

else 

//  backward: 

13 

14 

if 

( 

(n>=2) 

) 

{ 

comb_enup_compl (n-2 , k-0,  z) ; } 

// 

00 

15 

if 

( 

(n>=2) 

&Sc 

(k>=l) 

) 

{ 

rv [k]  = n-1;  comb_enup_compl(n-2,  k-1,  z) 

; } 

// 

01 

16 

17 

18  } 

} 

if 

( 

(n>=l) 

kk 

(k>=l) 

) 

{ 

rv [k]  = n;  comb_enup_compl(n-l , k-1,  ! z) ; 

> 

// 

1 

An  algorithm  for  Chase’s  sequence  that  generates  delta  sets  is  described  in  m alg.C,  sect. 7. 2. 1.3],  an 
implementation  is  given  in  [FXT:  class  combination_chase  in  comb/combination-chase. h . The  routine 
generates  about  80  million  combinations  per  second  for  both  (^q)  and  (^)  [FXT:  comb/combination- 
chase-demo.  cc  I . 


6.7  Recursive  generation  of  certain  orderings 

We  give  a simple  recursive  routine  to  generate  the  orders  shown  in  figure  |6.7-A|  The  combinations  are 
generated  as  sets  [FXT:  class  comb_rec  in  comb/combination-rec.h  : 


1 class  comb_rec 

2 { 
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lexicographic 

Gray  code 

compl . enup 

compl.  Eades-McKay 

1 

111 

1 11 

1 11 

111 

2 

11.1.  . . 

1.  . .11. 

1.  . .1.1 

11.1. . . 

3 

11. . 1.  . 

1.  . .1.1 

1. . . 11. 

11.  .1. . 

4 

11. . .1. 

1.  .11.  . 

1. . 11.  . 

11.  . .1. 

5 

11 1 

1. .1.1. 

1. .1.1. 

11 1 

6 

1.11.  . . 

1. .1.  .1 

1. .1.  .1 

1.1.  . .1 

7 

1.1.1.  . 

1.11.  . . 

1.1.  . .1 

1.1. .1. 

8 

1.1. .1. 

1.1.1.  . 

1.1. .1. 

1.1.1.  . 

9 

1.1.  . .1 

1.1. .1. 

1.1.1.  . 

1.11.  . . 

10 

1. . 11.  . 

1.1.  . .1 

1.11.  . . 

1. .11.  . 

11 

1. .1.1. 

111 

Ill 

1. .1.1. 

12 

1. .1.  .1 

11.1. . . 

11.1.  . . 

1. .1.  .1 

13 

1. . .11. 

11.  .1. . 

11.  .1.  . 

1. . .1.1 

14 

1. . .1.1 

11.  . .1. 

11.  . .1. 

1. . .11. 

15 

1 11 

11 1 

11 1 

1 11 

16 

.111.  . . 

.1.  . .11 

.11.  . .1 

.1.  . .11 

17 

.11.1.  . 

.1. .11. 

.11. .1. 

.1. .1.1 

18 

.11. .1. 

.1. .1.1 

.11.1.  . 

.1. .11. 

19 

.11.  . .1 

.1.11.  . 

.111.  . . 

.1.11.  . 

20 

.1.11.  . 

.1.1.1. 

.1.11.  . 

.1.1.1. 

21 

.1.1.1. 

.1.1.  .1 

.1.1.1. 

.1.1.  .1 

22 

.1.1.  .1 

. 111.  . . 

.1.1.  .1 

.11.  . .1 

23 

.1. .11. 

.11.1.  . 

.1.  .1.1 

.11. .1. 

24 

.1.  .1.1 

.11. .1. 

.1. .11. 

.11.1.  . 

25 

.1. . .11 

.11.  . .1 

.1. . .11 

.111.  . . 

26 

. .111.  . 

. .1.  .11 

. .1. .11 

. .111.  . 

27 

. .11.1. 

. .1.11. 

. .1.1.1 

. .11.1. 

28 

. .11.  .1 

. .1.1.1 

. .1.11. 

. .11.  .1 

29 

. .1.11. 

. .111.  . 

. .111.  . 

. .1.1.1 

30 

. .1.1.1 

. .11.1. 

. .11.1. 

. .1.11. 

31 

. .1.  .11 

. .11.  .1 

. .11.  .1 

. .1. .11 

32 

...  111. 

. . .1.11 

. . .11.1 

. . .1.11 

33 

. . .11.1 

. . .111. 

...  111. 

. . .11.1 

34 

. . .1.11 

. . .11.1 

. . .1.11 

. . .111. 

35 

111 

111 

111 

111 

Figure  6.7-A:  All  combinations  Q)  in  lexicographic,  minimal- change,  complemented  enup,  and  comple- 
mented Eades-McKay  order  (from  left  to  right). 


3 public: 


4 

ulong  n_,  k_;  //  (n  choose  k) 

5 

ulong  *rv_;  //  combination:  k elements  0<=x[j]<k  in 

increasing  order 

6 

//  ==  Record  of  Visits  in  graph 

7 

ulong  rq_ ; //  condition  that  determines  the  order: 

8 

//  0 ==>  lexicographic  order 

9 

//  1 ==>  Gray  code 

10 

111  ==>  complemented  enup  order 

11 

//  3 ==>  complemented  Eades-McKay  sequence 

12 

ulong  nq_ ; //  whether  to  reverse  order 

13 

[ — snip — ] 

14 

void  (*visit_) (const  comb_rec  &) ; //  function  to  call 

with  each  combination 

15 

[ — snip — ] 

16 

17 

void  generate (void  (*visit) (const  comb  rec  &) , ulong  rq 

, ulong  nq=0) 

18 

19 

visit_  = visit; 

20 

rq_  = rq; 

21 

nq_  = nq; 

22 

ct_  = 0; 

23 

o 

II 

1 

-p 

o 

24 

next  rec (0) ; 

25 

> 

The  recursion  function  is  given  in  [FXT:  comb/combination-rec.cc 

1 void  comb_rec : : next_rec (ulong  d) 

2 { 

3 ulong  r = k_  - d;  //  number  of  elements  remaining 

4 if  ( 0==r  ) visit_(*this) ; 

5 else 

6 { 

7 

8 
9 


ulong  rvl  = rv_[d-l];  //  left  neighbor 

bool  q; 

switch  ( rq_  ) 


6. 7:  Recursive  generation  of  certain  orderings 
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10 

{ 

11 

case  0: 

q = 1;  break; 

//  0 

==>  lexicographic  order 

12 

case  1 : 

q = ! (d&l) ; break; 

//  1 

==>  Gray  code 

13 

case  2: 

q = rvl&l ; break; 

//  2 

==>  complemented  enup  order 

14 

case  3: 

q = (d~rvl)&l;  break; 

//  3 

==>  complemented  Eades-McKay  sequence 

15 

def  ault : 

q = i; 

16 

> 

17 

q ~=  nq_ 

; //  reversed  order  if 

nq  == 

true 

18 

19 

if  ( q ) 

//  forward: 

20 

for 

(ulong  x=rvl+l;  x<=n_-r; 

++x) 

{ rv_ [d]  = x;  next_rec (d+1) ; } 

21 

else 

//  backward: 

22 

for 

(ulong  x=n_-r;  (long)x>= 

(long)rvl+l;  — x)  { rv_ [d]  = x;  next_rec(d+l) ; 

23 

> 

24  } 

Figure  6.7-A  was  created  with  the  program  [FXT:  comb/combination-rec-demo.cc|.  The  routine  generates 
the  combinations  (2p)  at  a rate  of  about  35  million  objects  per  second.  The  combinations  (12)  are 
generated  at  a rate  of  64  million  objects  per  second. 
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Chapter  7 

Compositions 


The  compositions  of  n into  at  most  k parts  are  the  ordered  tuples  (&o,  %i,  ■ ■ ■ > %k- 1)  where  Xq  + X\  + 

. . . + Xk-i  = n and  0 < Xi  < n.  Order  matters:  one  4-conrposition  of  7 is  (0, 1,  5, 1),  different  ones  are 
(5,  0, 1, 1)  and  (0,  5, 1, 1).  The  compositions  of  n into  at  most  k parts  are  also  called  ‘^-compositions  of 
n’.  To  obtain  the  compositions  of  n into  exactly  k parts  (where  k < n)  generate  the  compositions  of  n—  k 
into  k parts  and  add  one  to  each  position. 

7.1  Co-lexicographic  order 


composition 

chg 

combination 

composition  chg 

combination 

1 

[ 3 

. ] 

4 

111 

1 

[ 7 

. ] 

2 

1111111. . 

2 

[ 2 

1 

. ] 

1 

11.1.  . . 

2 

[ 6 

1 

. ] 

1 

111111.1. 

3 

[ 1 

2 

. ] 

1 

1.11.  . . 

3 

[ 5 

2 

. ] 

1 

11111.11. 

4 

[ • 

3 

. ] 

1 

.111.  . . 

4 

[ 4 

3 

. ] 

1 

1111.111. 

5 

[ 2 

l 

. ] 

2 

11. .1.  . 

5 

[ 3 

4 

. ] 

1 

111.1111. 

6 

[ 1 

i 

l 

. ] 

1 

1.1.1.  . 

6 

[ 2 

5 

. ] 

1 

11.11111. 

7 

[ • 

2 

l 

. ] 

1 

.11.1.  . 

7 

[ 1 

6 

. ] 

1 

1.111111. 

8 

[ 1 

2 

. ] 

2 

1. .11.  . 

8 

[ • 

7 

. ] 

1 

.1111111. 

9 

[ . 

i 

2 

. ] 

1 

.1.11.  . 

9 

[ 6 

1 ] 

2 

111111. .1 

10 

[ • 

3 

. ] 

2 

. .111.  . 

10 

[ 5 

i 

1 ] 

1 

11111.1.1 

11 

[ 2 

1 

. ] 

3 

11. . .1. 

11 

[ 4 

2 

1 ] 

1 

1111.11.1 

12 

[ 1 

i 

1 

. ] 

1 

1.1. .1. 

12 

[ 3 

3 

1 ] 

1 

111.111.1 

13 

[ • 

2 

1 

. ] 

1 

.11. .1. 

13 

[ 2 

4 

1 ] 

1 

11.1111.1 

14 

[ 1 

i 

1 

. ] 

2 

1. .1.1. 

14 

[ 1 

5 

1 ] 

1 

1.11111.1 

15 

[ • 

i 

1 

1 

. ] 

1 

.1.1.1. 

15 

[ • 

6 

1 ] 

1 

.111111.1 

16 

[ • 

2 

1 

. ] 

2 

. .11.1. 

16 

[ 5 

2 ] 

2 

11111. .11 

17 

[ 1 

2 

. ] 

3 

1. . .11. 

17 

[ 4 

i 

2 ] 

1 

1111.1.11 

18 

[ . 

i 

2 

. ] 

1 

.1. .11. 

18 

[ 3 

2 

2 ] 

1 

111.11.11 

19 

[ . 

i 

2 

. ] 

2 

. .1.11. 

19 

[ 2 

3 

2 ] 

1 

11.111.11 

20 

[ • 

3 

. ] 

3 

. . .111. 

20 

[ 1 

4 

2 ] 

1 

1.1111.11 

21 

[ 2 

1 ] 

4 

11 1 

21 

[ • 

5 

2 ] 

1 

.11111.11 

22 

[ 1 

i 

1 ] 

1 

1.1.  . .1 

22 

[ 4 

3 ] 

2 

1111. .111 

23 

[ • 

2 

1 ] 

1 

.11.  . .1 

23 

[ 3 

i 

3 ] 

1 

111.1.111 

24 

[ 1 

i 

1 ] 

2 

1. .1.  .1 

24 

[ 2 

2 

3 ] 

1 

11.11.111 

25 

[ • 

i 

1 

1 ] 

1 

.1.1.  .1 

25 

[ 1 

3 

3 ] 

1 

1.111.111 

26 

[ . 

2 

1 ] 

2 

. .11.  .1 

26 

[ • 

4 

3 ] 

1 

.1111.111 

27 

[ 1 

i 

1 ] 

3 

1.  . .1.1 

27 

[ 3 

4 ] 

2 

111. .1111 

28 

[ . 

i 

1 

1 ] 

1 

.1.  .1.1 

28 

[ 2 

i 

4 ] 

1 

11.1.1111 

29 

[ • 

i 

1 

1 ] 

2 

. .1.1.1 

29 

[ 1 

2 

4 ] 

1 

1.11.1111 

30 

[ • 

2 

1 ] 

3 

. . .11.1 

30 

[ • 

3 

4 ] 

1 

.111.1111 

31 

[ 1 

2 ] 

4 

1 11 

31 

[ 2 

5 ] 

2 

11. .11111 

32 

[ • 

i 

2 ] 

1 

.1.  . .11 

32 

[ 1 

i 

5 ] 

1 

1.1.11111 

33 

[ • 

i 

2 ] 

2 

. .1.  .11 

33 

[ • 

2 

5 ] 

1 

.11.11111 

34 

[ • 

i 

2 ] 

3 

. . .1.11 

34 

[ 1 

6 ] 

2 

1. .111111 

35 

[ • 

3 ] 

4 

111 

35 

[ • 

i 

6 ] 

1 

.1.111111 

36 

[ • 

7 ] 

2 

. .1111111 

Figure  7.1-A:  The  compositions  of  3 into  5 parts  in  co-lexicographic  order,  positions  of  the  rightmost 


change,  and  delta  sets  of  the  corresponding  combinations  (left);  and  the  corresponding  data  for  compo- 
sitions of  7 into  3 parts  (right).  Dots  denote  zeros. 
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The  compositions  in  co-lexicographic  (colex)  order  are  shown  in  figure  |7.1-A  The  generator  is  imple- 
mented as  [FXT:  class  composition_colex  in  comb/composition-colexli  : 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


class  composition_colex 

{ 

public : 

ulong  n_,  k 
ulong  *x_ ; 

[ — snip — ] 


//  composition  of  n into  k parts 
//  data  (k  elements) 


void  first () 

{ 

x_  [0]  = n_ 


//  all  in  first  position 


} 


for  (ulong  k=l;  k<k_;  ++k)  x_  [k]  = 0; 


void  last() 

{ 

for  (ulong  k=0;  k<k_;  ++k)  x_ [k]  = 0; 
x_[k_-l]  = n_;  //  all  in  last  position 

} 

[ — snip — ] 

The  methods  to  compute  the  successor  and  predecessor  are: 

ulong  next () 

//  Return  position  of  rightmost  change,  return  k with  last  composition. 

1 

ulong  j = 0; 

while  ( 0==x_[j]  ) ++j ; //  find  first  nonzero 

if  ( j==k_-l  ) return  k_;  //  current  composition  is  last 

ulong  v = x_[j];  //  value  of  first  nonzero 


10 

X_[j]  = 0; 

//  set  to  zero 

11 

x_ [0]  = v - 1; 

//  value-1  to  first  position 

12 

++j ; 

13 

It 

++x_  [j]  ; 

//  increment  next  position 

return  j ; 

16 

> 

17 

18 

ulong  prev() 

19 

// 

Return  position  of 

rightmost  change,  return  k with  last  ci 

20 

21 

const  ulong  v = x_ 

. [0] ; //  value  at  first  position 

22 

23 

if  ( n_==v  ) return  k_;  //  current  composition  is  first 

24 

25 

o 

ii 

0 

1  i 

i 

X 

//  set  first  position  to  zero 

26 

ulong  j = 1; 

27 

while  ( 0==x_[j]  ) 

++j ; //  find  next  nonzero 

28 

— x_[j]  ; 

//  decrement  value 

29 

qn 

x_  [j-1]  = 1 + v; 

//  set  previous  position 

i 

return  j ; 

32 

> 

With  each  transition  at  most  3 entries  are  changed.  The  compositions  of  10  into  30  parts  (sparse  case) 
are  generated  at  a rate  of  about  110  million  per  second,  the  compositions  of  30  into  10  parts  (dense 
case)  at  about  200  million  per  second  [FXT:  comb/composition-colex-demo.cc  . With  the  dense  case 
(corresponding  to  the  right  of  figure  7.1-A)  the  computation  is  faster  as  the  position  to  change  is  found 
earlier. 


Optimized  implementation 

An  implementation  that  is  efficient  also  for  the  sparse  case  (that  is,  k much  greater  than  n)  is  [FXT:  class 
composition_colex2  in  comb/composition-colex2.h  . One  additional  variable  pO  records  the  position  of 
the  first  nonzero  entry.  The  method  to  compute  the  successor  is: 

1 class  composition_colex2 

2 { 

3 [ — snip — ] 
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4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


n 


20 

& 


23 

24 


ulong  next() 

//  Return  position  of  rightmost  change,  return  k with  last  composition. 

{ 

ulong  j = p0_;  //  position  of  first  nonzero 

if  ( j==k_-l  ) return  k_;  //  current  composition  is  last 

ulong  v = x_[j];  //  value  of  first  nonzero 

x_[j]  = 0;  //  set  to  zero 

— v; 

x_ [0]  = v;  //  value-1  to  first  position 

++p0_ ; //  first  nonzero  one  more  right  except  ... 

if  ( 0 ! =v  ) p0_  =0;  //  . . . if  value  v was  not  one 

++j ; 

++x_[j];  II  increment  next  position 

return  j ; 

} 

}; 


About  270  million  compositions  are  generated  per  second,  independent  of  either  n and  k [FXT: 
comb/composition-colex2-demo.cc  . With  the  line 

#def ine  C0MP_C0LEX2_MAX_ARRAY_LEN  128 


just  before  the  class  definition  an  array  is  used  instead  of  a pointer.  The  fixed  array  length  limits  the 
value  of  k so  by  default  the  line  is  commented  out.  Using  an  array  gives  a significant  speedup,  the  rate 
is  about  365  million  per  second  (about  6 CPU  cycles  per  update). 


7.2  Co-lexicographic  order  for  compositions  into  exactly  k parts 


The  compositions  of  n into  exactly  k parts  (where  k > n ) can  be  obtained  from  the  compositions  of 
n — k into  at  most  k parts  as  shown  in  figure  7. 2- A The  listing  was  created  with  the  program  [FXT: 
comb/composition-ex-colex-demo.cc  . The  compositions  can  be  generated  in  co- lexicographic  order  using 
[FXT:  class  composition_ex_colex  in  comb/composition-ex-colex.h|: 

1 class  composition_ex_colex 

2 { 

3 public: 

4 ulong  n_,  k_;  //  composition  of  n into  exactly  k parts 

5 ulong  *x_ ; //  data  (k  elements) 

6 ulong  nkl_;  //  ==n-k+l 

8 public : 

9 composition_ex_colex (ulong  n,  ulong  k) 


10 

// 

Must  have  n>=k 

11 

12 

n_  = n; 

13 

k_  = k; 

14 

nkl_  = n - k + 1;  //m 

15 

if  ( (long)nkl_  < 1 ) 

16 

x_  = new  ulong [k_  + 1] ; 

17 

x_  [k]  = 0 ; //  not  one 

18 

first () ; 

19 

> 

20 

[ — snip — ] 

nkl_ 


>= 

l; 


//  avoid  hang  with  invalid  pair  n,k 


The  variable  nkl_  is  the  maximal  entry  in  the  compositions: 

1 void  first () 

2 { 

3 x_ [0]  = nkl_;  //  all  in  first  position 

4 for  (ulong  k=l;  k<k_;  ++k)  x_ [k]  = 1; 

5 } 

6 

7 void  lastO 

8 { 

9 for  (ulong  k=0;  k<k_;  ++k)  x_  [k]  = 1; 

10  x_[k_-l]  = nkl_;  //  all  in  last  position 

11  > 


7.2:  Co-lexicographic  order  for  compositions  into  exactly  k parts 
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exact  comp. 

chg 

composition 

1 

[ 4 

1 

1 

1 

1 1 

4 

[ 3 

. . ] 

2 

[ 3 

2 

1 

1 

1 1 

1 

[ 2 

1 

. . ] 

3 

[ 2 

3 

1 

1 

1 1 

1 

[ 1 

2 

. . ] 

4 

[ 1 

4 

1 

1 

1 1 

1 

[ . 

3 

. . ] 

5 

[ 3 

1 

2 

1 

1 1 

2 

[ 2 

l 

. . ] 

6 

[ 2 

2 

2 

1 

1 1 

1 

[ 1 

i 

l 

. . ] 

7 

[ 1 

3 

2 

1 

1 1 

1 

[ . 

2 

1 

. . ] 

8 

[ 2 

1 

3 

1 

1 1 

2 

[ 1 

2 

. . ] 

9 

[ 1 

2 

3 

1 

1 1 

1 

[ • 

i 

2 

. . ] 

10 

[ 1 

1 

4 

1 

1 1 

2 

[ . 

3 

. . ] 

11 

[ 3 

1 

1 

2 

1 1 

3 

[ 2 

1 . ] 

12 

[ 2 

2 

1 

2 

1 1 

1 

[ 1 

i 

1 . ] 

13 

[ 1 

3 

1 

2 

1 1 

1 

[ . 

2 

1 . ] 

14 

[ 2 

1 

2 

2 

1 1 

2 

[ 1 

i 

1 . ] 

15 

[ 1 

2 

2 

2 

1 1 

1 

[ • 

i 

1 

1 . ] 

16 

[ 1 

1 

3 

2 

1 1 

2 

[ . 

2 

1 . ] 

17 

[ 2 

1 

1 

3 

1 1 

3 

[ 1 

2 . ] 

18 

[ 1 

2 

1 

3 

1 1 

1 

[ • 

i 

2 . ] 

19 

[ 1 

1 

2 

3 

1 1 

2 

[ . 

i 

2 . ] 

20 

[ 1 

1 

1 

4 

1 1 

3 

[ . 

3 . ] 

21 

[ 3 

1 

1 

1 

2 1 

4 

[ 2 

. 1 ] 

22 

[ 2 

2 

1 

1 

2 1 

1 

[ 1 

i 

. 1 ] 

23 

[ 1 

3 

1 

1 

2 1 

1 

[ . 

2 

. 1 ] 

24 

[ 2 

1 

2 

1 

2 1 

2 

[ 1 

i 

. 1 ] 

25 

[ 1 

2 

2 

1 

2 1 

1 

[ . 

i 

1 

. 1 ] 

26 

[ 1 

1 

3 

1 

2 1 

2 

[ . 

2 

. 1 ] 

27 

[ 2 

1 

1 

2 

2 1 

3 

[ 1 

1 1 ] 

28 

[ 1 

2 

1 

2 

2 1 

1 

[ . 

i 

1 1 ] 

29 

[ 1 

1 

2 

2 

2 1 

2 

[ . 

i 

1 1 ] 

30 

[ 1 

1 

1 

3 

2 1 

3 

[ . 

2 1 ] 

31 

[ 2 

1 

1 

1 

3 1 

4 

[ 1 

. 2 ] 

32 

[ 1 

2 

1 

1 

3 1 

1 

[ • 

i 

. 2 ] 

33 

[ 1 

1 

2 

1 

3 1 

2 

[ • 

i 

. 2 ] 

34 

[ 1 

1 

1 

2 

3 I 

3 

[ . 

1 2 ] 

35 

[ 1 

1 

1 

1 

4 J 

4 

[ . 

. 3 ] 

Figure  7.2-A:  The  compositions  of  n = 8 into  exactly  k = 5 parts  (left)  are  obtained  from  the  compo- 


sitions of  n — k = 3 into  at  most  k = 5 parts  (right).  Co- lexicographic  order.  Dots  denote  zeros. 


The  methods  for  computing  the  successor  and  predecessor  are  adaptations  from  the  routines  from  the 
compositions  into  at  most  k parts: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

It 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 


ulong  next () 

//  Return  position  of  rightmost  change,  return  k with  last  composition. 

1 

ulong  j = 0; 

while  ( l==x_[j]  ) ++j ; //  find  first  greater  than  one 

if  ( j==k_  ) return  k_;  //  current  composition  is  last 

ulong  v = x_[j];  //  value  of  first  greater  one 

x_[j]  =1;  //  set  to  1 

x_ [0]  = v - 1;  //  value-1  to  first  position 

++j ; 

++x_[j];  //  increment  next  position 

return  j ; 

> 

ulong  prevO 

//  Return  position  of  rightmost  change,  return  k with  last  composition. 

1 

const  ulong  v = x_  [0] ; //  value  at  first  position 

if  ( nkl_==v  ) return  k_;  //  current  composition  is  first 

x_  [0]  =1;  //  set  first  position  to  1 

ulong  j = 1; 

while  ( l==x_[j]  ) ++j ; //  find  next  greater  than  one 

— x_  [ j ] ; //  decrement  value 
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29  x_[j-l]  = 1 + v;  //  set  previous  position 

jl?  return  i : 

32  } 

33  }; 


The  routines  are  as  fast  as  the  generation  into  at  most  k parts  with  the  corresponding  parameters:  the 
compositions  of  40  into  10  parts  are  generated  at  a rate  of  about  200  million  per  second. 


7.3  Compositions  and  combinations 


combination 

delta  set 

composition 

1 

[ 0 

1 

2 ] 

111.  . . 

[ 3 

. ■ ] 

2 

[ 0 

2 

3 ] 

1.11.  . 

[ 1 

2 

. . ] 

3 

[ 1 

2 

3 ] 

.111.  . 

[ . 

3 

. . ] 

4 

[ o 

1 

3 ] 

11.1.  . 

[ 2 

1 

. . ] 

5 

[ o 

3 

4 ] 

1. .11. 

[ 1 

2 . ] 

6 

[ 1 

3 

4 ] 

.1.11. 

[ ■ 

i 

2 . ] 

7 

[ 2 

3 

4 ] 

. .111. 

[ ■ 

3 . ] 

8 

[ 0 

2 

4 ] 

1.1.1. 

[ 1 

i 

1 . ] 

9 

[ 1 

2 

4 ] 

.11.1. 

[ . 

2 

1 . ] 

10 

[ 0 

1 

4 ] 

11. .1. 

[ 2 

1 . ] 

11 

[ 0 

4 

5 ] 

1. . .11 

[ 1 

. 2 ] 

12 

[ 1 

4 

5 ] 

.1. .11 

[ • 

i 

. 2 ] 

13 

[ 2 

4 

5 ] 

. .1.11 

[ ■ 

1 2 ] 

14 

[ 3 

4 

5 ] 

...  Ill 

[ . 

. 3 ] 

15 

[ 0 

3 

5 ] 

1. .1.1 

[ 1 

1 1 ] 

16 

[ 1 

3 

5 ] 

.1.1.1 

[ • 

i 

1 1 ] 

17 

[ 2 

3 

5 ] 

. .11.1 

[ . 

2 1 ] 

18 

[ 0 

2 

5 ] 

1.1. .1 

[ 1 

i 

. 1 ] 

19 

[ 1 

2 

5 ] 

.11. .1 

[ • 

2 

. 1 ] 

20 

[ 0 

1 

5 ] 

11.  . .1 

[ 2 

. 1 ] 

Figure  7.3-A:  Combinations  6 choose  3 (left)  and  the  corresponding  compositions  of  3 into  4 parts 
(right).  The  sequence  of  combinations  is  a Gray  code  but  the  sequence  of  compositions  is  not. 


Figure  [7.3-A|  shows  the  correspondence  between  compositions  and  combinations.  The  listing  was  gener- 
ated using  the  program  [FXT:  comb/comb2comp-demo.cc|.  Entries  in  the  left  column  are  combinations 
of  3 parts  out  of  6.  The  middle  column  is  the  representation  of  the  combinations  as  delta  sets.  It  also  is 
a binary  representation  of  a composition:  A run  of  r consecutive  ones  corresponds  to  an  entry  r in  the 
composition  at  the  right. 

Now  write  P(n,  k)  for  the  compositions  of  n into  (at  most)  k parts  and  B(N,  I\)  for  the  combination  (^) : 
A composition  of  n into  at  most  k parts  corresponds  to  a combination  of  K = n parts  from  N = n + k—1 
elements,  symbolically: 

P{n,k ) o B(N,  K)  = B{n  + k - 1,  n)  (7.3-la) 

A combination  of  K elements  out  of  N corresponds  to  a composition  of  n into  at  most  k parts  where 
n = K and  k = N — K + 1: 


B{N , K)  o P(n , k)  = P(K,  N-K  + 1) 


(7.3-lb) 


We  give  routines  for  the  conversion  between  combinations  and  compositions.  The  following  routine 
converts  a composition  into  the  corresponding  combination  [FXT:  comb/comp2comb.h  : 

inline  void  comp2comb (const  ulong  *p,  ulong  k,  ulong  *b) 

//  Convert  composition  P(*,  k)  in  p[]  to  combination  in  b [] 

{ 

for  (ulong  j=0,i=0,z=0;  j<k;  ++j) 

{ 

ulong  p j = p [ j ] ; 

for  (ulong  w=0;  w<pj ; ++w)  b[i++]  = z++; 

++z; 

> 


10  } 
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The  conversion  of  a combination  into  the  corresponding  composition  can  be  implemented  as 

1 inline  void  comb2comp (const  ulong  *b,  ulong  N,  ulong  K,  ulong  *p) 

2 //  Convert  combination  B(N,  K)  in  b[]  to  composition  P(*,k)  in  p[] 


3 

// 

Must  have : K>0 

4 

{ 

5 

ulong  k = N-K+l; 

6 

for  (ulong  z=0;  z<k;  ++z)  p [z]  = 0; 

7 

— k; 

8 

ulong  cl  = N; 

9 

while  ( K — ) 

10 

1 

11 

ulong  cO  = b[K]  ; 

12 

ulong  d = cl  - cO; 

13 

k -=  (d-1) ; 

14 

++p  [k]  ; 

15 

cl  = cO; 

16 

> 

17 

} 

7.4  Minimal-change  orders 


composition 

combination 

composition 

combination 

1 

[ 

3 

. 1 

. . .111. 

[ 3 

4 

5 ] 

1 

[ 

3 

. ] 

111 

[ 0 

1 

2 ] 

2 

[ 

i 

2 

. ] 

.1.  .11. 

[ 1 

4 

5 ] 

2 

[ 

2 

l 

. ] 

11.1.  . . 

[ 0 

1 

3 ] 

3 

[ 

i 

2 

. 1 

1.  . .11. 

[ 0 

4 

5 ] 

3 

[ 

1 

2 

. ] 

1.11.  . . 

[ 0 

2 

3 ] 

4 

[ 

l 

2 

. ] 

. .1.11. 

[ 2 

4 

5 ] 

4 

[ 

3 

. ] 

.111.  . . 

[ 1 

2 

3 ] 

5 

[ 

2 

1 

. 1 

. .11.1. 

[ 2 

3 

5 ] 

5 

[ 

2 

1 

. ] 

.11.1.  . 

[ 1 

2 

4 ] 

6 

[ 

i 

1 

1 

. 1 

.1.1.1. 

[ 1 

3 

5 ] 

6 

[ 

i 

1 

1 

. ] 

1.1.1.  . 

[ 0 

2 

4 ] 

7 

[ 

i 

1 

1 

. ] 

1.  .1.1. 

[ 0 

3 

5 ] 

7 

[ 

2 

1 

. ] 

11.  .1.  . 

[ 0 

1 

4 ] 

8 

[ 

2 

1 

. ] 

11. . .1. 

[ 0 

1 

5 ] 

8 

[ 

1 

2 

. ] 

1.  .11.  . 

[ 0 

3 

4 ] 

9 

[ 

1 

i 

1 

. ] 

1.1.  .1. 

[ 0 

2 

5 ] 

9 

[ 

i 

2 

. ] 

.1.11.  . 

[ 1 

3 

4 ] 

10 

[ 

2 

1 

. ] 

.11.  .1. 

[ 1 

2 

5 ] 

10 

[ 

3 

. ] 

. .111.  . 

[ 2 

3 

4 ] 

11 

[ 

3 

. ] 

.111.  . . 

[ 1 

2 

3 ] 

11 

[ 

2 

1 

. ] 

. .11.1. 

[ 2 

3 

5 ] 

12 

[ 

i 

2 

. ] 

1.11.  . . 

[ 0 

2 

3 ] 

12 

[ 

i 

1 

1 

. ] 

1. .1.1. 

[ 0 

3 

5 ] 

13 

[ 

2 

1 

. ] 

11.1.  . . 

[ 0 

1 

3 ] 

13 

[ 

i 

1 

1 

. ] 

.1.1.1. 

[ 1 

3 

5 ] 

14 

[ 

3 

. 1 

Ill 

[ 0 

1 

2 ] 

14 

[ 

2 

1 

. ] 

.11.  .1. 

[ 1 

2 

5 ] 

15 

[ 

2 

i 

. ] 

11.  .1.  . 

[ 0 

1 

4 ] 

15 

[ 

i 

1 

1 

. ] 

1.1.  .1. 

[ 0 

2 

5 ] 

16 

[ 

1 

i 

1 

. ] 

1.1.1.  . 

[ 0 

2 

4 ] 

16 

[ 

2 

1 

. ] 

11. . .1. 

[ 0 

1 

5 ] 

17 

[ 

2 

1 

. ] 

.11.1.  . 

[ 1 

2 

4 ] 

17 

[ 

1 

2 

. ] 

1. . .11. 

[ 0 

4 

5 ] 

18 

[ 

1 

2 

. ] 

.1.11.  . 

[ 1 

3 

4 ] 

18 

[ 

i 

2 

. ] 

.1.  .11. 

[ 1 

4 

5 ] 

19 

[ 

i 

2 

. ] 

1.  .11.  . 

[ 0 

3 

4 ] 

19 

[ 

i 

2 

. ] 

. .1.11. 

[ 2 

4 

5 ] 

20 

[ 

3 

. ] 

. .111.  . 

[ 2 

3 

4 ] 

20 

[ 

3 

. ] 

. . .111. 

[ 3 

4 

5 ] 

21 

[ 

2 

1 ] 

. .11.  .1 

[ 2 

3 

6 ] 

21 

[ 

2 

1 ] 

. . .11.1 

[ 3 

4 

6 ] 

22 

[ 

i 

1 

1 1 

.1.1.  .1 

[ 1 

3 

6 ] 

22 

[ 

i 

1 

1 ] 

1.  . .1.1 

[ 0 

4 

6 ] 

23 

[ 

i 

1 

1 ] 

1.  .1.  .1 

[ 0 

3 

6 ] 

23 

[ 

i 

1 

1 ] 

.1.  .1.1 

[ 1 

4 

6 ] 

24 

[ 

2 

1 ] 

11 1 

[ 0 

1 

6 ] 

24 

[ 

i 

1 

1 ] 

. .1.1.1 

[ 2 

4 

6 ] 

25 

[ 

1 

i 

1 ] 

1.1.  . .1 

[ 0 

2 

6 ] 

25 

[ 

2 

1 ] 

. .11.  .1 

[ 2 

3 

6 ] 

26 

[ 

2 

1 ] 

.11.  . .1 

[ 1 

2 

6 ] 

26 

[ 

i 

1 

1 ] 

1.  .1.  .1 

[ 0 

3 

6 ] 

27 

[ 

1 

i 

1 ] 

.1.  .1.1 

[ 1 

4 

6 ] 

27 

[ 

i 

1 

1 ] 

.1.1.  .1 

[ 1 

3 

6 ] 

28 

[ 

i 

1 

1 ] 

1.  . .1.1 

[ 0 

4 

6 ] 

28 

[ 

2 

1 ] 

.11.  . .1 

[ 1 

2 

6 ] 

29 

[ 

i 

1 

1 ] 

. .1.1.1 

[ 2 

4 

6 ] 

29 

[ 

i 

1 

1 ] 

1.1.  . .1 

[ 0 

2 

6 ] 

30 

[ 

2 

1 ] 

. . .11.1 

[ 3 

4 

6 ] 

30 

[ 

2 

1 ] 

11 1 

[ 0 

1 

6 ] 

31 

[ 

1 

2 ] 

. . .1.11 

[ 3 

5 

6 ] 

31 

[ 

1 

2 ] 

1 11 

[ 0 

5 

6 ] 

32 

[ 

i 

2 ] 

.1.  . .11 

[ 1 

5 

6 ] 

32 

[ 

i 

2 ] 

.1. . .11 

[ 1 

5 

6 ] 

33 

[ 

i 

2 ] 

1 11 

[ 0 

5 

6 ] 

33 

[ 

i 

2 ] 

. .1. .11 

[ 2 

5 

6 ] 

34 

[ 

i 

2 ] 

. .1.  .11 

[ 2 

5 

6 ] 

34 

[ 

i 

2 ] 

. . .1.11 

[ 3 

5 

6 ] 

35 

[ 

3 ] 

111 

[ 4 

5 

6 ] 

35 

[ 

3 ] 

111 

[ 4 

5 

6 ] 

Figure  7.4-A:  Compositions  of  3 into  5 parts  and  the  corresponding  combinations  as  delta  sets  and  sets 


in  two  minimal-change  orders:  order  with  enup  moves  (left)  and  order  with  modulo  moves  (right).  The 
ordering  by  enup  moves  is  a two-close  Gray  code.  Dots  denote  zeros. 


A minimal-change  order  (Gray  code)  for  compositions  is  such  that  with  each  transition  one  entry  is 
increased  by  1 and  another  is  decreased  by  1.  A recursion  for  the  compositions  P(n , k)  of  n into  k parts 


200 


Chapter  7:  Compositions 


combination  composition 


combination  composition 


1 

[ 0 

5 

6 ] 

1 11 

[ 1 

2 ] 

1 

[ 0 

1 

2 ] 

111 

[ 3 

] 

2 

[ 0 

4 

6 ] 

1. . .1.1 

[ 1 

i 

1 ] 

2 

[ 0 

1 

3 ] 

11.1. . . 

[ 2 

i 

] 

3 

[ 0 

4 

5 ] 

1. . .11. 

[ 1 

2 

. 1 

3 

[ 0 

1 

4 ] 

11.  .1.  . 

[ 2 

i 

] 

4 

[ 0 

3 

4 ] 

1. .11.  . 

[ 1 

2 

. ] 

4 

[ 0 

1 

5 ] 

11. . .1. 

[ 2 

i 

] 

5 

[ 0 

3 

5 ] 

1. .1.1. 

[ 1 

1 

i 

. ] 

5 

[ 0 

1 

6 ] 

11 1 

[ 2 

i 

] 

6 

[ 0 

3 

6 ] 

1. .1.  .1 

[ 1 

1 

1 ] 

6 

[ 0 

2 

6 ] 

1.1.  . .1 

[ 1 

i 

1 

] 

7 

[ 0 

2 

6 ] 

1.1.  . .1 

[ 1 

1 

1 ] 

7 

[ 0 

2 

5 ] 

1.1. .1. 

[ 1 

1 

i 

] 

8 

[ 0 

2 

5 ] 

1.1. .1. 

[ 1 

1 

i 

. 1 

8 

[ 0 

2 

4 ] 

1.1.1.  . 

[ 1 

1 

i 

] 

9 

[ 0 

2 

4 ] 

1.1.1.  . 

[ 1 

1 

i 

. ] 

9 

[ 0 

2 

3 ] 

1.11... 

[ 1 

2 

] 

10 

[ 0 

2 

3 ] 

1.11.  . . 

[ 1 

2 

. ] 

10 

[ 0 

3 

4 ] 

1. .11.  . 

[ 1 

2 

1 

11 

[ 0 

1 

2 ] 

Ill 

[ 3 

. ] 

11 

[ 0 

3 

5 ] 

1. .1.1. 

[ 1 

1 

i 

] 

12 

[ 0 

1 

3 ] 

11.1.  . . 

[ 2 

i 

. 1 

12 

[ 0 

3 

6 ] 

1. .1.  .1 

[ 1 

1 

i 

] 

13 

[ 0 

1 

4 ] 

11.  .1.  . 

[ 2 

i 

. 1 

13 

[ 0 

4 

6 ] 

1. . .1.1 

[ 1 

i 

1 

] 

14 

[ 0 

1 

5 ] 

11.  . .1. 

[ 2 

i 

. ] 

14 

[ 0 

4 

5 ] 

1. . .11. 

[ 1 

2 

] 

15 

[ 0 

1 

6 ] 

11 1 

[ 2 

1 ] 

15 

[ 0 

5 

6 ] 

1 11 

[ 1 

2 

] 

16 

[ 1 

2 

6 ] 

.11. . .1 

[ ■ 

2 

1 ] 

16 

[ 1 

5 

6 ] 

.1.  . .11 

[ . 

i 

2 

] 

17 

[ 1 

2 

5 ] 

.11. .1. 

[ ■ 

2 

i 

. 1 

17 

[ 1 

4 

6 ] 

.1. .1.1 

[ • 

1 

i 

1 

] 

18 

[ 1 

2 

4 ] 

.11.1.  . 

[ . 

2 

i 

. ] 

18 

[ 1 

4 

5 ] 

.1. .11. 

[ . 

1 

2 

] 

19 

[ 1 

2 

3 ] 

. 111.  . . 

[ ■ 

3 

. ] 

19 

[ 1 

3 

4 ] 

.1.11.  . 

[ . 

l 

2 

] 

20 

[ 1 

3 

4 ] 

.1.11.  . 

[ • 

1 

2 

. ] 

20 

[ 1 

3 

5 ] 

.1.1.1. 
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32 
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2 

] 

33 
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33 
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[ . 

2 

1 

] 

34 
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5 

6 ] 
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[ ■ 

1 

2 ] 

34 
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4 

5 ] 
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3 

] 

35 

[ 4 

5 

6 ] 
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3 ] 

35 
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5 

6 ] 
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[ . 

3 

] 

Figure  7.4-B:  The  (reversed)  complemented  enup  ordering  (left)  and  Eades-McKay  sequence  (right)  for 


combinations  correspond  to  compositions  where  only  two  adjacent  entries  change  with  each  transition, 


but  by  more  than  1 in  general. 


in  lexicographic  order  is  (notation  as  in 


relation  14.1-1|  on  page  304 ) 


P(n, k ) 


[0  . P{n  — 0,  fc  — 1)] 
[1  . P(n  — l,k  — 1)] 
[2  . P(n-2,k-  1)] 
[3  . P(n  - 3,  k - 1)] 
[4  . P(n  - 4,  k - 1)] 


[n  . P{ 0,  k-  1)  ] 


A Gray  code  is  obtained  by  changing  the  direction  if  the  element  is  even: 


[0  . PR(n  - 0,  k - 1)] 
[1  . P(n  — l,k  - 1)  ] 
[2  . PR(n  - 2,  fc  - 1)] 
P(n,k)  = [3  . p[n  — 3,  fc  — 1)  ] 
[4  . PR(n  — 4,  k — 1)] 


(7.4-1) 


(7.4-2) 


The  ordering  is  shown  in  figure  7.4- A (left),  the  corresponding  combinations  are  in  the  (reversed)  enup 


7.4:  Minimal-change  orders 
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order  from  section  6.6.2  on  page  188  Now  we  change  directions  at  the  odd  elements: 


[0. 

. P(n  — 

0,  k — 

1)  ] 

[1. 

. PR(n 

— 1,  k 

-1)] 

[2. 

. P{n  — 

2 ,k- 

1)  ] 

[3. 

. PR(n 

— 3 ,k 

-1)] 

[4. 

. P[n  — 

4,k  — 

1)  ] 

(7.4-3) 


We  get  an  ordering  (right  of  figure  7.4-A  I corresponding  to  the  combinations  are  in  the  (reversed) 
Eades-McKay  order  from  section  |6.5|  on  page  183  The  listings  were  created  with  the  program  [FXT : 
comb  / composition-gray- rec-demo.cc  | . 

Gray  codes  for  combinations  correspond  to  Gray  codes  for  combinations  where  no  element  in  the  delta 
set  crosses  another.  The  standard  Gray  code  for  combinations  does  not  lead  to  a Gray  code  for  compo- 
sitions as  shown  in  figure  [773- A|  on  page|198~l  If  the  directions  in  the  recursions  are  always  changed,  the 
compositions  correspond  to  combinations  that  have  the  complemented  delta  sets  of  the  standard  Gray 
code  in  reversed  order. 


Orderings  where  the  changes  involve  just  one  pair  of  adjacent  entries  (shown  in  figure  7.4-B ) correspond  to 
the  complemented  strong  Gray  codes  for  combinations.  The  amount  of  change  is  greater  than  1 in  general. 
The  listings  were  created  with  the  program  [FXT:  comb/combination-rec-demo.cc  , see  section  6.7  on  page 

mu 
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Chapter  8 

Subsets 


We  give  algorithms  to  generate  all  subsets  of  a set  of  n elements.  There  are  2n  subsets,  including  the 
empty  set.  We  further  give  methods  to  generate  all  subsets  with  k elements  where  k lies  in  a given  range: 
kmin  < k < kmax.  The  subsets  with  exactly  k elements  are  treated  in  chapter  [6]  on  page|176] 

8.1  Lexicographic  order 


1 

1 

{0} 

1 

to} 

2 

11.  . . 

{0, 

1} 

.1.  . . 

{1} 

3 

111.  . 

{0, 

1, 

2} 

11.  . . 

to, 

1} 

4 

1111. 

{0, 

1, 

2, 

3} 

. . 1.  . 

{2} 

5 

11111 

{0, 

1. 

2, 

3,  4} 

1.1.  . 

to, 

2} 

6 

111.1 

{0, 

1, 

2, 

4> 

.11.  . 

tl, 

2} 

7 

11.1. 

{0, 

1, 

3} 

111.  . 

to, 

1, 

2} 

8 

11.11 

{0, 

1, 

3, 

4> 

...  1. 

{3> 

9 

11.  .1 

{0, 

1, 

4} 

1.  .1. 

to, 

3} 

10 

1.1.  . 

{0, 

2} 

.1.1. 

{1, 

3} 

11 

1.11. 

{0, 

2, 

3} 

11.1. 

to, 

1, 

3} 

12 

1.111 

{0, 

2, 

3, 

4} 

. . 11. 

{2, 

3} 

13 

1.1.1 

{0, 

2, 

4} 

1.11. 

to, 

2, 

3} 

14 

1.  .1. 

{0, 

3} 

.111. 

tl, 

2, 

3} 

15 

1.  .11 

{0, 

3, 

4} 

1111. 

to, 

1, 

2, 

3} 

16 

1.  . .1 

to, 

4} 

1 

{4> 

17 

.1.  . . 

{1} 

1.  . .1 

to, 

4} 

18 

.11.  . 

{1, 

2} 

.1.  .1 

{1, 

4} 

19 

.111. 

{1, 

2, 

3} 

11.  .1 

to, 

1, 

4} 

20 

.1111 

{1, 

2, 

3, 

4> 

. .1.1 

t2, 

4} 

21 

.11.1 

{1, 

2, 

4} 

1.1.1 

to, 

2, 

4} 

22 

.1.1. 

{1, 

3} 

.11.1 

a. 

2, 

4} 

23 

.1.11 

{1, 

3, 

4} 

111.1 

to, 

1, 

2, 

4} 

24 

.1.  .1 

{1, 

4} 

...  11 

{3, 

4} 

25 

. .1.  . 

m 

1.  .11 

to, 

3, 

4} 

26 

. .11. 

{2, 

3} 

.1.11 

a, 

3, 

4} 

27 

. .111 

{2, 

3, 

4} 

11.11 

to, 

1, 

3, 

4} 

28 

. .1.1 

{2, 

4> 

. . Ill 

{2, 

3, 

4} 

29 

. . .1. 

{3> 

1.111 

to, 

2, 

3, 

4} 

30 

. . .11 

{3, 

4} 

.1111 

a. 

2, 

3, 

4} 

31 

1 

{4> 

11111 

to, 

1, 

2, 

3,  4} 

Figure  8.1-A:  Nonempty  subsets  of  a 5-element  set  in  lexicographic  order  for  the  sets  (left)  and  in 


lexicographic  order  for  the  delta  sets  (right). 


The  (nonempty)  subsets  of  a set  of  five  elements  in  lexicographic  order  are  shown  in  figure  8.1-A  Note 
that  the  lexicographic  order  with  sets  is  different  from  the  lexicographic  order  with  delta  sets. 


8.1.1  Generation  as  delta  sets 

The  listing  on  the  right  side  of  figure  |8.1-A|  is  with  respect  to  the  delta  sets.  It  was  created  with  the 
program  [FXT:  comb/subset-deltalex-demo.cc  which  uses  the  generator  [FXT:  class  subset_deltalex 
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in  comb/subset-deltalex.h  : 

1 class  subset_deltalex 

2 { 

3 public: 

4 ulong  *d_;  //  subset  as  delta  set 

5 ulong  n_;  //  subsets  of  the  n-set  {0 , 1 , 2, . . . ,n-l} 

f public: 

8 subset_deltalex (ulong  n) 

n_  = n; 

d_  = new  ulong  [n+1] ; 
d_ [n]  =0;  //  sentinel 

f irst  () ; 

16  ~ subset _deltalex()  { delete  []  d_;  1 

17 

18  void  first ()  {for  (ulong  k=0;  k<n_;  ++k)  d_  [k]  =0;  } 


10 

11 

12 

13 

14 


The  algorithm  for  the  computation  of  the  successor  is  binary  counting: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


bool  next() 

{ 

ulong  k = 0 ; 

while  ( d_[k]==l  ) { d_[k]=0;  ++k;  } 

if  ( k==n_  ) return  false;  //  current  subset  is  last 

d_[k]  = 1; 
return  true ; 

} 

const  ulong  * dataQ  const  { return  d_;  } 


About  176  million  subsets  per  second  are  generated  and  192  M/s  if  an  array  is  used.  A bit-level  algorithm 
to  compute  the  subsets  in  lexicographic  order  is  given  in  section  |1.26  on  page  70| 


8.1.2  Generation  as  sets 

The  lexicographic  order  with  respect  to  the  set  representation  is  shown  at  the  left  side  of  figure  |8.1-A 
The  routines  in  [FXT:  class  subset_lex  in  comb/subset-lex. h compute  the  nonempty  sets: 

1 class  subset_lex 

2 { 

3 public: 

4 ulong  *x_ ; //  subset  of  {0 , 1 ,2 , . . . ,n-l} 

5 ulong  n_;  //  number  of  elements  in  set 

6 ulong  k_;  //  index  of  last  element  in  subset 

7 //  Number  of  elements  in  subset  ==  k+1 

§ public: 

10  subset_lex (ulong  n) 

11  { 

12  n_  = n; 

13  x_  = new  ulong [n_] ; 

14  f irst () ; 


17  ~subset_lex()  { delete  []  x_;  } 

18 

19  ulong  first () 

20  { 

21  k_  = 0; 

22  x_ [0]  = 0; 

23  return  k_  + 1 ; 

24  } 

25 

26  ulong  lastQ 

27  { 

28  k_  = 0; 

29  x_  [0]  = n_  - 1 ; 

30  return  k_  + 1 ; 

31  > 
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32  [ — snip — ] 

The  method  nextO  computes  the  successor: 

1 ulong  nextO 

2 //  Generate  next  subset 

3 //  Return  number  of  elements  in  subset 

4 //  Return  zero  if  current  ==  last 

if  ( x_  [k_]  ==  n_-l  ) //  last  element  is  max  ? 

{ 

if  ( k_==0  ) { firstO;  return  0;  }• 

— k_ ; //  remove  last  element 

x_[k_]++;  //  increase  last  element 

} 

else  //  add  next  element  from  set: 

{ 

++k_ ; 

x_  [k_]  = x_[k_-l]  + 1; 

} 

return  k_  + 1 ; 


Computation  of  the  predecessor: 

1 ulong  prevO 

2 //  Generate  previous  subset 

3 //  Return  number  of  elements  in  subset 

4 //  Return  zero  if  current  ==  first 

5 { 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


24  } 

25 

26  const  ulong  * dataO  const  { return  x_;  } 

27  }; 

About  270  million  subsets  per  second  are  generated  with  nextO  and  about  155  million  with  prevO 
[FXT:  comb/subset-lex-demo. cc|.  A generalization  of  this  order  with  mixed  radix  numbers  is  described 
in  section  |9.3  on  page  224|  A bit-level  algorithm  is  given  in  section  |1.26|  on  page  [70} 


8.2  Minimal-change  order 


8.2.1  Generation  as  delta  sets 

The  subsets  of  a set  with  5 elements  in  minimal-change  order  are  shown  in  figure  [A2-A|  The  implementa- 
tion [FXT:  class  subset_gray .delta  in  comb/subset-gray-delta. h uses  the  Gray  code  of  binary  words 
and  updates  the  position  corresponding  to  the  bit  that  changes  in  the  Gray  code: 

class  subset_gray_delta 

//  Subsets  of  the  set  {0, 1 ,2 n-1}  in  minimal-change  (Gray  code)  order. 

{ 

public : 

ulong  *x_ ; //  current  subset  as  delta-set 

ulong  n_;  //  number  of  elements  in  set  <=  BITS_PER_L0NG 

ulong  j_;  //  position  of  last  change 


1 

2 

3 

4 

5 

6 
7 


if  ( k_  ==  0 ) //  only  one  element  ? 

{ 

if  ( x_  [0] ==0  ) { last () ; return  0;  } 

x_ [0] — ; //  deer  first  element 

x_  [++k_]  = n_  — 1;  //  add  element 

} 

else 

{ 

if  ( x_  [k_]  ==  x_[k_-l]+l  ) — k_;  //  remove  last  element 

else 

{ 

x_ [k_] — ; //  deer  last  element 

x_  [++k_]  = n_  - 1 ; //  add  element 

} 

} 

return  k_  + 1 ; 


5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


20 
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11111 
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{0} 

1 

.1111 

< 

1, 

2 

3 

4 

> 

2 

11.  . . 

fo, 

1} 

2 

. .111 

{ 

2, 

3 

4 

> 

3 

.1.  . . 

{1} 

3 
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> 
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21 
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21 
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} 

22 
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22 
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Figure  8.2-A:  The  subsets  of  the  set  {0,  1,  2,  3,  4}  in  minimal-change  order  (left)  and  complemented 


minimal-change  order  (right).  The  changes  are  on  the  same  places  for  both  orders. 


8 ulong  ct_;  //  gray_code(ct_)  corresponds  to  the  current  subset 

9 ulong  mct_;  //  max  value  of  ct . 

1!?  public: 

12  subset_gray_delta (ulong  n) 

13  -C 

14 

15 

16 
17 


20  ~subset_gray_delta()  { delete  []  x_;  } 

21 


n_  = (n  ? n : 1);  //  not  zero 

x_  = new  ulong [n_] ; 
mct_  = (lUL«n)  - 1; 
first  (0) ; 


In  the  initializer  one  can  choose  whether  the  first  set  is  the  empty  or  the  full  set  (left  and  right  of 
figure  |8.2-A| : 


1 void  first (ulong  v=0) 

2 { 

3 ct_  = 0; 

4 j_  = n_  - 1; 

5 for  (ulong  j =0 ; j<n_;  ++ j ) x_[j]  = v; 

6 > 

7 

8 const  ulong  * data()  const  { return  x_ ; } 

9 ulong  pos()  const  { return  j_;  } 

10  ulong  current ()  const  { return  ct  ; } 

11 

12  ulong  next () 

13  //  Return  position  of  change,  return  n with  last  subset 

14  { 

15  if  ( ct_  ==  mct_  ) { return  n_ ; } 

16 
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17 

++ct_ ; 

18 

j = lowest  one  idx(  ct  ); 

19 

on 

x_[j_l  '=  1; 

2? 

return  j_; 

22 

> 

23 

24 

ulong  prev() 

25 

// 

Return  position  of  change,  return  n with  first  subset 

26 

{ 

27 

if  ( ct_  ==  0 ) { return  n_;  } 

28 

29 

j = lowest  one  idx(  ct  ); 

30 

x_[j_l  ~=  1; 

31 

— ct_ ; 

return  j_; 

34 

35  1 

h; 

About  180  million  subsets  are  generated  per  second  [FXT:  comb/subset-gray-delta-demo. cc  . 


8.2.2  Generation  as  sets 


A generator  for  the  subsets  of  {1,  2,  . . . , n}  in  set  representation  is  [FXT:  class  subset_gray  in 
comb/subset-gray.h  : 


1 

2 

3 

4 

5 

6 

7 


10 

11 

12 

13 

14 

15 

16 

17 

18 


21 

22 

23 

24 

25 


class  subset_gray 

//  Subsets  of  the  set  {l,2,...,n)-  in  minimal-change  (Gray  code)  order. 

{ 

public : 

ulong  *x_ ; //  data  k-subset  of  {1,2, ... ,n}  in  x[l, . . . ,k] 

ulong  n_;  //  subsets  of  n-set 

ulong  k_;  //  number  of  elements  in  subset 

public : 

subset_gray (ulong  n) 

{ 

n_  = n; 

x_  = new  ulong [n_+l] ; 
x_  [0]  = 0; 
first () ; 

> 

~ subset _gray()  { delete  []  x_;  } 

ulong  first  ()  { k_  = 0;  return  k_ ; } 

ulong  lastQ  { x_[l]  =1;  k_  = 1;  return  k_;  } 

const  ulong  * dataQ  const  { return  x_+l;  } 
const  ulong  num()  const  { return  k_ ; } 


1 

2 

3 

4 

5 

6 

7 


see  also  [192] : 


private : 

ulong  next_even() 

{ 


The  algorithm  to  compute  the  successor  is  described  in  section  1.16.3  on  page  43 


if  ( x_ [k_] ==n_  ) //  remove  n (from  end) : 

{ 

— k_; 

> 


else 


/ / append  n : 


9 { 


10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


++k_ ; 

x_[k_]  = n_ ; 

> 

return  k_ ; 


ulong  next_odd() 

{ 

if  ( x_  [k_] -l==x_  [k_-l]  ) //  remove  x[k]-l  (from  position  k-1)  : 

{ 

x_  [k_-l]  = x_ [k_] ; 

— k_; 


8.3:  Ordering  with  De  Bruijn  sequences 
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22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 
17 


} 


else  //  insert  x[k]-l  as  second  last  element: 

{ 

x_  [k_+l]  = x_ [k_] ; 

— x_  [k_]  ; 

++k_ ; 

> 

return  k_ ; 


public : 

ulong  next () 

4 

if  ( 0==(k_&l  ) ) return  next_even() ; 
else  return  next_odd() ; 

> 

ulong  prev() 

4 

if  ( 0==(k_&l  ) ) //  k even 

4 

if  ( 0==k_  ) return  lastO; 
return  next_odd() ; 

} 

else  return  next_even() ; 


}; 


> 


About  241  million  subsets  per  second  are  generated  with  nextO  and  about  167  M/s  with  prev()  [FXT: 
comb/subset-gray-demo.cc  . With  arrays  instead  of  pointers  the  rates  are  about  266  M/s  and  179  M/s. 

8.2.3  Computing  just  the  positions  of  change 

The  following  routine  computes  only  the  locations  of  the  changes,  it  is  given  in  [52].  It  can  also  be 
obtained  as  a specialization  (for  radix  2)  of  the  loopless  algorithm  for  computing  a Gray  code  ordering 
of  mixed  radix  numbers  given  section  9.2  on  page  220  [FXT:  class  ruler_func  in  comb/ruler-func.h  : 

1 class  ruler_func 

2 //  Ruler  function  sequence:  0102010301020104010201  ... 

3 4 

4 public: 

5 ulong  *f_;  //  focus  pointer 

6 ulong  n_; 

8 public : 


10 

4 

11 

n_  = n; 

12 

f_  = new  ' 

13 

first () ; 

14 

} 

15 

16 

~ruler_func () 

17 

18 

void  firstO 

19 

20 

ulong  nextO 

21 

4 

22 

const  ulo: 

23 

//  if  ( 

24 

f _ [0]  = 0 

25 

const  ulo: 

26 

f_[j]  = 1 

27 

f-[nj]  = : 

28 

return  j ; 

29 

1 

30  ] 

h; 

4 delete  []  f_;  } 


) 4 firstO;  return  n_;  1 //  leave  to  user 

= j+i; 


nj; 


The  rate  of  generation  is  about  244  M/s  and  293  M/s  if  an  array  is  used  [FXT:  comb/ruler-func-demo.cc  . 


05  cn  4^C0  to 
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Figure  8. 3- A:  Subsets  of  a 5-element  set  in  an  order  corresponding  to  a De  Bruijn  sequence  (left),  and 


alternative  ordering  obtained  by  complementing  the  elements  at  even  indices  (right). 


8.3  Ordering  with  De  Bruijn  sequences 


A curious  ordering  for  all  subsets  of  a given  set  can  be  generated  using  a binary  De  Bruijn  sequence  that 
is  a cyclic  sequence  of  zeros  and  ones  that  contains  each  n-bit  word  once.  In  figure |8. 3- A| the  empty  places 
of  the  subsets  are  included  to  make  the  nice  feature  apparent  [FXT:  comb/subset-debruijn-demo.cc  . The 
ordering  has  the  single  track  property:  each  column  in  this  (delta  set)  representation  is  a circular  shift 
of  the  first  column.  Each  subset  is  made  from  its  predecessor  by  shifting  it  to  the  right  and  inserting  the 
current  element  from  the  sequence.  The  underlying  De  Bruijn  sequence  is 

10001100101001110101101111100000 


The  implementation  [FXT:  class  subset_debruijn 
binary _debruijn  in  comb/binary-debruijn.h  , described  in  section  18.2 


in  comb/subset-debruijn.h  uses 
on  page  377 


[FXT:  class 


Successive  subsets  differ  in  many  elements  if  the  sequency  (see  section  1.17  on  page  46 ) is  large.  Using  the 


‘sequency-complemented’  subsets  (see  end  of  section  1.17),  we  obtain  an  ordering  where  more  elements 


change  with  small  sequencies,  as  shown  at  the  right  of  figure  |8.3-A  This  ordering  corresponds  to  the 
complement-shift  sequence  of  section [20.2.3  on  page  397] 


8.4  Shifts-order  for  subsets 


Figure  |8.4-A 
all  linear  shi: 


shows  an  ordering  ( shifts-order ) of  the  nonempty  subsets  of  a 6-bit  binary  word  where 
ts  of  a word  appear  in  succession.  The  generation  is  done  by  a simple  recursion  [FXT: 


comb /shift-subsets-demo.cc  | : 


1 ulong  n;  //  number  of  bits 
ulong  N;  //  2**n 


void  A (ulong  x) 

{ 

if  ( x>=N  ) return; 
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8.4:  Shifts-order  for  subsets 
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Figure  8.4-A:  Nonempty  subsets  of  a 

6-bit  binary  word  where  all  linear  shifts  of  a 

word  appear  in 

succession  (shifts-order). 

All  shifts  are  left  shifts. 
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Figure  8.4-B:  Nonempty  subsets  of  a 

6-bit  binary  word  where  all  linear  shifts  of  a 

word  appear  in 

succession  and  transitions  that  are  not  shifts  switch  just  one  bit  (minimal-change  shifts-order). 
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Figure  8.4-C:  Nonzero  Fibonacci  words  in  an  order  where  all  shifts  appear  in  succession. 


visit (x) ; 
A(2*x) ; 
A(2*x+1) ; 


The  function  visit  ()  prints  the  binary  expansion  of  its  argument.  The  initial  call  is  A(l). 


The  transitions  that  are  not  shifts  change  just  one  bit  if  the  following  pair  of  functions  is  used  for  the 
recursion  (minimal- change  shifts-order  shown  in  figure  8.4-B|: 


void  F(ulong  x) 

{ 

if  ( x>=N  ) return; 
visit (x) ; 

F(2*x) ; 

G(2*x+1) ; 
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} 

void  G (ulong  x) 

{ 

if  ( x>=N  ) return; 

F(2*x+1) ; 

G(2*x) ; 
visit (x) ; 

} 

The  initial  call  is  F(l),  the  reversed  order  can  be  generated  via  G(l). 

A simple  variation  can  be  used  to  generate  the  Fibonacci  words  in  a shifts-order  shown  in  figure  [874- C[ 
With  transitions  that  are  not  shifts  more  than  one  bit  is  changed  in  general.  The  function  used  is  |FXT: 
comb/shift-subsets-demo.ccj: 

1 void  B (ulong  x) 

2 { 

3 if  ( x>=N  ) return; 

4 visit (x) ; 

5 B(2*x) ; 

6 B(4*x+1) ; 

7 } 

A bit-level  algorithm  for  combinations  in  shifts-order  is  given  in  section  E ,24.3|  on  page  [64} 


7 

8 
9 

10 

11 

12 

13 

14 

15 


8.5  &;-subsets  where  k lies  in  a given  range 

We  give  algorithms  for  generating  all  k-subsets  of  the  n-set  where  k lies  in  the  range  kmin  < k < kmax. 
If  kmin  = 0 and  kmax  = n,  we  generate  all  subsets.  If  kmin  = kma.x  = k,  we  get  the  fc-combinations  of  n. 


8.5.1  Recursive  algorithm 


A generator  for  all  fc-subsets  where  k lies  in  a prescribed  range  is  [FXT:  class  ksubset_rec  in 


comb/ksubset-rec.h  . The  used  algorithm  can  generate  the  subsets  in  16  different  orders.  Figure  8.5- 


|A|  shows  the  lexicographic  orders,  figure  |8.5-B|  shows  three  Gray  codes.  The  constructor  has  just  one 
argument,  the  number  of  elements  of  the  set  whose  subsets  are  generated: 
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class  ksubset_rec 

//  k-subsets  where  kmin<=k<=kmax  in  various  orders. 

//  Recursive  CAT  algorithm. 

{ 

public : 

long  n_;  //  subsets  of  a n-element  set 

long  kmin_,  kmax_ ; //  k-subsets  where  kmin<=k<=kma 

long  *rv_;  //  record  of  visits  in  graph  (list  of  elements  in  subset) 

ulong  ct_;  //  count  subsets 

ulong  rct_;  //  count  recursions  (==work) 

ulong  rq_ ; //  condition  that  determines  the  order 

ulong  pq_ ; //  condition  that  determines  the  (printing)  order 

ulong  nq_ ; //  whether  to  reverse  order 

//  function  to  call  with  each  combination: 

void  (*visit_) (const  ksubset_rec  k,  long); 

public : 

ksubset_rec (ulong  n) 

{ 

n_  = n; 

rv_  = new  long[n_+l]; 

++rv_ ; 

rv_  [-1]  = -1UL; 

> 

~ksubset_rec () 

— rv_ ; 

delete  []  rv_ ; 

} 


One  has  to  supply  the  interval  for  k (variables  kmin  and  kmax)  and  a function  that  will  be  called  with 
each  subset.  The  argument  rq  determines  which  of  the  sixteen  different  orderings  is  chosen,  the  order 
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order 

#0: 

order  #8 : 

0 

11 

< 

0, 

1 > 

111. . . 

0, 

1,  2 

> 

1 

111.  . . 

. .P. . . 

{ 

0, 

1,  2 

> 

11.1.  . 

. . MP . . 

{ 

0, 

1,  3 

} 

2 

11.1.  . 

. . MP . . 

{ 

0, 

1,  3 

> 

11. .1. 

. . . MP . 

{ 

0, 

1,  4 

> 

3 

11. .1. 

. . . MP . 

{ 

o, 

1,  4 

> 

11.  . . 1 

MP 

{ 

0, 

1,  5 

> 

4 

11.  . .1 

MP 

< 

o, 

1,  5 

> 

11 

M 

{ 

0, 

1 > 

5 

1.1.  . . 

. MP . . M 

{ 

o, 

2 > 

1.11.  . 

. MPP . . 

{ 

0, 

2,  3 

> 

6 

1.11.  . 

. . .P. . 

{ 

o, 

2,  3 

> 

1.1.1. 

. . . MP . 

{ 

0, 

2,  4 

> 

7 

1.1.1. 

. . . MP . 

{ 

o, 

2,  4 

> 

1.1.  .1 

MP 

{ 

0, 

2,  5 

> 

8 

1.1.  .1 

MP 

< 

o, 

2,  5 

> 

1.1.  . . 

M 

{ 

0, 

2 > 

9 

1.  . 1.  . 

. . MP . M 

{ 

o, 

3 > 

1.  .11. 

. . MPP . 

{ 

0, 

3,  4 

> 

10 

1.  .11. 

. . . .P. 

{ 

o, 

3,  4 

> 

1.  .1.1 

MP 

{ 

0, 

3,  5 

> 

11 

1.  .1.1 

MP 

{ 

o, 

3,  5 

> 

1.  .1.  . 

M 

{ 

0, 

3 > 

12 

1. . .1. 

. . . MPM 

< 

o, 

4 > 

1. . .11 

. . . MPP 

{ 

0, 

4,  5 

} 

13 

1. . .11 

P 

{ 

o, 

4,  5 

> 

1. . .1. 

M 

{ 

0, 

4 > 

14 

1 1 

M. 

{ 

o, 

5 > 

1 1 

MP 

{ 

0, 

5 > 

15 

.11.  . . 

MPP . . M 

{ 

1, 

2 > 

.111. . 

MPPP . M 

{ 

1, 

2,  3 

> 

16 

.111.  . 

. . .P. . 

< 

1, 

2,  3 

> 

.11.1. 

. . . MP . 

{ 

1, 

2,  4 

> 

17 

.11.1. 

. . . MP . 

{ 

1, 

2,  4 

> 

.11.  .1 

MP 

{ 

1, 

2,  5 

> 

18 

.11.  .1 

MP 

{ 

1, 

2,  5 

> 

.11.  . . 

M 

{ 

1, 

2 > 

19 

.1.1.  . 

. . MP . M 

{ 

1, 

3 > 

.1.11. 

. . MPP . 

{ 

1, 

3,  4 

> 

20 

.1.11. 

. . . .P. 

< 

1, 

3,  4 

> 

.1.1.1 

MP 

{ 

1, 

3,  5 

> 

21 

.1.1.1 

MP 

{ 

1, 

3,  5 

> 

.1.1.  . 

M 

{ 

1, 

3 } 

22 

.1. .1. 

. . . MPM 

{ 

1, 

4 > 

.1. .11 

. . . MPP 

{ 

1, 

4,  5 

} 

23 

.1. .11 

P 

{ 

1, 

4,  5 

> 

.1.  .1. 

M 

{ 

1, 

4 } 

24 

.1.  . .1 

M. 

< 

1, 

5 > 

.1.  . .1 

MP 

{ 

1, 

5 > 

25 

. .11.  . 

. MPP . M 

{ 

2, 

3 > 

. .111. 

. MPPPM 

{ 

2, 

3,  4 

> 

26 

. .111. 

. . . .P. 

{ 

2, 

3,  4 

> 

. .11.1 

MP 

{ 

2, 

3,  5 

> 

27 

. .11.1 

MP 

{ 

2, 

3,  5 

> 

. .11.  . 

M 

{ 

2, 

3 > 

28 

. .1.1. 

. . . MPM 

{ 

2, 

4 > 

. .1.11 

. . . MPP 

{ 

2, 

4,  5 

} 

29 

. .1.11 

P 

{ 

2, 

4,  5 

> 

. .1.1. 

M 

{ 

2, 

4 > 

30 

. .1.  .1 

M. 

{ 

2, 

5 > 

. .1.  .1 

MP 

{ 

2, 

5 > 

31 

. . .11. 

. . MPPM 

{ 

3, 

4 > 

. . .111 

. . MPP . 

{ 

3, 

4,  5 

} 

32 

. . .111 

P 

{ 

3, 

4,  5 

> 

. . .11. 

M 

{ 

3, 

4 > 

33 

. . .1.1 

M. 

{ 

3, 

5 } 

. . .1.1 

MP 

{ 

3, 

5 > 

34 

11 

. . . MP . 

{ 

4, 

5 > 

11 

. . . MP . 

{ 

4, 

5 > 

Figure  8.5-A:  The  fc-subsets  (where  2 < k < 3)  of  a 6-element  set.  Lexicographic  order  for  sets  (left) 
and  reversed  lexicographic  order  for  delta  sets  (right). 


can  be  reversed  with  nonzero  nq. 
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16 
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18 
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22 


void  generate (void  (*visit) (const  ksubset_rec  k,  long), 

long  kmin,  long  kmax,  ulong  rq,  ulong  nq=0) 

{ 

ct_  = 0; 
rct_  = 0; 

kmin_  = kmin; 
kmax_  = kmax; 

if  ( kmin_  > kmax_  ) swap2(kmin_,  kmax_) ; 

if  ( kmax_  > n_  ) kmax_  = n_ ; 

if  ( kmin_  > n_  ) kmin_  = n_ ; 

visit_  = visit; 
rq_  = rq  / 4; 
pq_  = (rq»2)  ’/.  4; 
nq_  = nq; 
next_rec (0) ; 

} 

private : 

void  next_rec (long  d) ; 

}; 


The  recursive  routine  itself  is  given  in  [FXT:  comb/ksubset-rec.cc  : 

1 void 

2 ksubset_rec : :next_rec (long  d) 

3 { 

4 if  ( d>kmax_  ) return; 
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order 

#6: 

order 

#7: 

order 

#10: 

0 

1 1 

11 

1 1 

1 

1.  . . 11 

P. 

111.  . . 

. .P. . . 

1.  . . 1. 

PM 

2 

1.  . .1. 

M 

11.1.  . 

. . MP . . 

1. . .11 

P 

3 

1.  .1.  . 

...  PM . 

11.  .1. 

. . . MP . 

1. .11. 

. . .P.M 

4 

1.  .11. 

P. 

11.  . .1 

MP 

1.  .1.1 

MP 

5 

1.  .1.1 

MP 

1.1.  .1 

. MP . . . 

1.  .1.  . 

M 

6 

1.1.  .1 

. . PM . . 

1.1.1. 

PM 

1.1.  . . 

. . PM . . 

7 

1.1.1. 

PM 

1.11.  . 

...  PM . 

1.1.  .1 

P 

8 

1.11.  . 

...  PM . 

1.1.  . . 

. . .M. . 

1.1.1. 

PM 

9 

1.1.  . . 

. . .M.  . 

1.  .1.  . 

. . MP . . 

1.11.  . 

...  PM . 

10 

11 

. PM . . . 

1. .11. 

P. 

111.  . . 

.P.M. . 

11 

111.  . . 

. .P. . . 

1.  .1.1 

MP 

11.1.  . 

. . MP . . 

12 

11.1.  . 

. . MP . . 

1.  . . 11 

. . . MP . 

11. .1. 

. . . MP  . 

13 

11. .1. 

. . . MP  . 

1. . .1. 

M 

11.  . .1 

MP 

14 

11.  . .1 

MP 

1 1 

MP 

11 

M 

15 

.11.  .1 

M.P.  . . 

.1.  . .1 

MP 

.11.  . . 

M.P.  . . 

16 

.11.1. 

PM 

.1. .11 

P. 

.11.  .1 

P 

17 

.111.  . 

...  PM . 

.1. .1. 

M 

.11.1. 

PM 

18 

.11.  . . 

. . .M.  . 

.1.1.  . 

...  PM . 

.111.  . 

...  PM . 

19 

.1.1.  . 

. . MP . . 

.1.11. 

P. 

.1.11. 

. .M.P. 

20 

.1.11. 

P. 

.1.1.1 

MP 

.1.1.1 

MP 

21 

.1.1.1 

MP 

.11.  .1 

. . PM . . 

.1.1.  . 

M 

22 

.1.  .11 

. . . MP  . 

.11.1. 

PM 

.1. .1. 

. . . MP . 

23 

.1. .1. 

M 

.111.  . 

...  PM . 

.1. .11 

P 

24 

.1.  . .1 

MP 

.11.  . . 

. . .M. . 

.1.  . .1 

M. 

25 

. .1.  .1 

. MP . . . 

. .11.  . 

.M.P. . 

. .1.  .1 

. MP . . . 

26 

. .1.11 

P. 

. .111. 

P. 

. .1.1. 

PM 

27 

. .1.1. 

M 

. .11.1 

MP 

. .1.11 

P 

28 

. .11.  . 

...  PM . 

. .1.11 

. . . MP . 

. .111. 

. . .P.M 

29 

. .111. 

P. 

. .1.1. 

M 

. .11.1 

MP 

30 

. .11.1 

MP 

. .1.  .1 

MP 

. .11.  . 

M 

31 

. . .111 

. .M.P. 

. . .1.1 

. . MP . . 

. . .11. 

. .M.P. 

32 

. . .11. 

M 

. . .111 

P. 

. . .111 

P 

33 

. . .1.1 

MP 

. . .11. 

M 

. . .1.1 

M. 

34 

11 

. . . MP . 

11 

. . .M.P 

11 

. . . MP . 

Figure  8.5-B:  Three  minimal-change  orders  of  the  fc-subsets  (where  2 < k < 3)  of  a 6-element  set. 


order 

#7: 

0 

32 

1 1 

MP 

0 

5 

1 

1 

P 

0 

33 

.1.  . .1 

MP 

1 

5 

2 

11 

.P 

0 

1 

34 

.1.  .11 

P. 

1 

4 

5 

3 

111.  . . 

. .P. . . 

0 

1 

2 

35 

.1.  .1. 

M 

1 

4 

4 

1111.  . 

. . .P. . 

0 

1 

2 

3 

36 

.1.1. . 

...  PM . 

1 

,3 

5 

11111. 

P. 

0 

1 

2 

3 

4 

37 

.1.11. 

P. 

1 

,3 

4 

6 

111111 

P 

0 

1 

2 

3 

4 5 

38 

.1.111 

P 

1 

,3 

4 

5 

7 

1111.1 

M. 

0 

1 

2 

3 

5 

39 

.1.1.1 

M. 

1 

,3 

5 

8 

111.11 

. . . MP  . 

0 

1 

2 

4 

5 

40 

.11.  .1 

. . PM . . 

1 

2 

5 

9 

111.1. 

M 

0 

1 

2 

4 

41 

.11.1. 

PM 

1 

2 

4 

10 

111.  .1 

MP 

0 

1 

2 

5 

42 

.11.11 

P 

1 

2 

4 

5 

11 

11.1.1 

. . MP . . 

0 

1 

3 

5 

43 

.111.1 

...  PM . 

1 

2 

3 

5 

12 

11.111 

P. 

0 

1 

3 

4 

5 

44 

.11111 

P. 

1 

2 

3 

4 5 

13 

11.11. 

M 

0 

1 

3 

4 

45 

.1111. 

M 

1 

2 

3 

4 

14 

11.1.  . 

M. 

0 

1 

3 

46 

.111.  . 

M. 

1 

2 

3 

15 

11. .1. 

. . . MP  . 

0 

1 

4 

47 

.11.  . . 

. . .M.  . 

1 

2 

16 

11.  .11 

P 

0 

1 

4 

5 

48 

.1 

. .M.  . . 

1 

17 

11.  . .1 

M. 

0 

1 

5 

49 

. .1.  . . 

. MP . . . 

2 

18 

1.1.  .1 

. MP . . . 

0 

2 

5 

50 

. . 11.  . 

. . .P. . 

2 

.3 

19 

1.1.1. 

PM 

0 

2 

4 

51 

. . 111. 

P. 

2 

,3 

4 

20 

1.1.11 

P 

0 

2 

4 

5 

52 

. .1111 

P 

2 

,3 

4 

5 

21 

1.11.1 

...  PM . 

0 

2 

3 

5 

53 

. .11.1 

M. 

2 

,3 

5 

22 

1.1111 

P. 

0 

2 

3 

4 

5 

54 

. .1.11 

. . . MP  . 

2 

4 

5 

23 

1.111. 

M 

0 

2 

3 

4 

55 

. .1.1. 

M 

2 

4 

24 

1.11.  . 

M. 

0 

2 

3 

56 

. .1.  .1 

MP 

2 

5 

25 

1.1.  . . 

. . .M.  . 

0 

2 

57 

. . .1.1 

. . MP . . 

.3 

5 

26 

1.  .1.  . 

. . MP . . 

0 

3 

58 

...  Ill 

P. 

3 

4 

5 

27 

1.  .11. 

P. 

0 

3 

4 

59 

...  11. 

M 

.3 

4 

28 

1.  .111 

P 

0 

.3 

4 

5 

60 

. . .1.  . 

M. 

3 

29 

1.  .1.1 

M. 

0 

.3 

5 

61 

1. 

. . . MP  . 

4 

30 

1.  . . 11 

. . . MP  . 

0 

4 

5 

62 

11 

P 

4 

5 

31 

1.  . .1. 

M 

0 

4 

63 

1 

M. 

5 

Figure  8.5-C:  With  kmin  = 0 and  order  number  seven  at  each  transition  either  one  element  is  added  or 


removed,  or  one  element  moves  to  an  adjacent  position. 
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10 

11 

12 

13 

14 

15 

16 
17 


20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 


++rct_;  //  measure  computational  work 
long  rvl  = rv_[d-l];  //  left  neighbor 
bool  q; 

switch  ( rq_  "/,  4 ) 


case  0 
case  1 
case  2 
case  3 
} 

if  ( nq_  ) q 


q = 1;  break; 
q = ! (d&l) ; break; 
q = rvl&l;  break; 
q = (d~rvl)&l;  break; 


!q; 


long  xO  = rvl  + 1 ; 

long  rx  = n_  - (kmin_  - d) ; 

long  xl  = min2(  n_-l,  rx  ); 

-x)  i 


(d>=kmin_)  ) { visit_(*this , d) ; ++ct_;  } 


#define  PCOND(x)  if  ( (pq. 

PCOND(O) ; 

if  ( q ) //  forward: 

{ 

PCOND(l) ; 

for  (long  x=xO;  x<=xl;  ++x) 

PC0ND(2) ; 

} 

else  //  backward: 

PC0ND(2) ; 

for  (long  x=xl;  x>=xO;  — x) 

PCOND(l) ; 

y 

PC0ND(3) ; 

#undef  PCOND 

} 

About  50  million  subsets  per  second  are  generated  [FXT:  comb/ksubset-rec-demo.cc  . 

8.5.2  Iterative  algorithm  for  a minimal-change  order 


{ rv_ [d]  = x;  next_rec (d+1) ; } 


{ rv_ [d]  = x;  next_rec (d+1) ; } 


delta  set 

diff 

set 

1 

...  11 

i 4, 

5 > 

2 

. .11. 

. .P.M 

{ 3, 

4 > 

3 

. .111 

P 

i 3, 

4,  5 > 

4 

. .1.1 

. . .M. 

{ 3, 

5 > 

5 

.11.  . 

.P.  .M 

i 2, 

3 > 

6 

.11.1 

P 

{ 2, 

3,  5 > 

7 

.1111 

. . .P. 

{ 2, 

3,  4,  5 

> 

8 

.111. 

M 

{ 2, 

3,  4 > 

9 

.1.1. 

. .M.  . 

{ 2, 

4 > 

10 

.1.11 

P 

{ 2, 

4,  5 > 

11 

.1.  .1 

. . .M. 

{ 2, 

5 > 

12 

11.  . . 

P.  . .M 

{ 1, 

2 > 

13 

11.  .1 

P 

t 1, 

2,  5 > 

14 

11.11 

. . .P. 

t 1, 

2,  4,  5 

> 

15 

11.1. 

M 

{ 1, 

2,  4 > 

16 

1111. 

. .P.  . 

i 1, 

2,  3,  4 

> 

17 

111.1 

. . .MP 

{ 1, 

2,  3,  5 

> 

18 

111.  . 

M 

{ 1, 

2,  3 > 

19 

1.1.  . 

. M . . . 

i 1, 

3 > 

20 

1.1.1 

P 

{ 1, 

3,  5 > 

21 

1.111 

. . .P. 

{ 1, 

3,  4,  5 

> 

22 

1.11. 

M 

i 1, 

3,  4 > 

23 

1.  .1. 

. .M.  . 

{ 1, 

4 > 

24 

1.  .11 

P 

t 1, 

4,  5 > 

25 

1.  . .1 

. . .M. 

f 1, 

5 > 

Figure  8.5-D:  The  (25)  k- subsets  where  2 < k < 4 of  a 5-element  set  in  a minimal-change  order. 


A generator  for  subsets  in  Gray  code  order  is  [FXT:  class  ksubset_gray  in  comb/ksubset-gray.h 
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class  ksubset_gray 

{ 

public : 

ulong  n_;  //  k-subsets  of  {1,  2,  . . . , n} 

ulong  kmin_,  kmax_;  //  kmin  <=  k <=  kmax 
ulong  k_;  //  k elements  in  current  set 

ulong  *S_ ; //  set  in  S[l,2,...,k]  with  elements  \in  {1,2,..., n} 

ulong  j_;  //  aux 

public : 

ksubset_gray (ulong  n,  ulong  kmin,  ulong  kmax) 

{ 

n_  = (n>0  ? n : 1) ; 

//  Must  have  l<=kmin<=kmax<=n 
kmin_  = kmin; 
kmax_  = kmax; 

if  ( kmax_  < kmin_  ) swap2(kmin_,  kmax_) ; 
if  ( kmin_==0  ) kmin_  = 1 ; 

S_  = new  ulong [kmax_+l] ; 

S_  [0]  =0;  //  sentinel:  !=  1 

f irst  ()  ; 

> 

~ksubset_gray ()  { delete  []  S_;  } 

const  ulong  *data()  const  { return  S_+l;  } 

ulong  numO  const  { return  k_;  } 

ulong  lastQ 

{ 

S_ [1]  =1;  k_  = kmin_; 
if  ( kmin_==l  ) { j_  = 1;  } 

else 

{ 

for  (ulong  i=2;  i<=kmin_;  ++i)  { S_[i]  = n_  - kmin_  + i;  } 

j-  = 2; 

} 

return  k_ ; 

} 


ulong  first () 

{ 

k_  = kmin_ ; 

for  (ulong  i=l;  i<=kmin_;  ++i)  { S_  [i]  = n_  - kmin_  + i ; } 

j-  = i; 

return  k_ ; 

} 

bool  is_first()  const  { return  ( S_  [1]  ==  n_  - kmin_  + 1 );  } 

bool  is_last()  const 

{ 

if  ( S_ [ 1]  !=  1 ) return  0; 

if  ( kmin_<=l  ) return  (k_==l) ; 
return  (S_ [2] ==n_-kmin_+2) ; 

} 

[ — snip — ] 


The  routines  for  computing  the  next  or  previous  subset  are  adapted  from  a routine  to  compute  the 
successor  given  in  jl92j.  It  is  split  into  two  auxiliary  functions: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


private : 

void  prev_even() 

{ 

ulong  &n=n_,  &kmin=kmin_ , &kmax=kmax_,  &j=j_; 

if  ( S_  [j  — 1]  ==  S_  [ j ] — 1 ) //  can  touch  sentinel  S [0] 

{ 

S_[j-1]  = S_[j]  ; 
if  ( j > kmin  ) 

{ 

if  ( S_ [kmin]  ==n)  { j = j-2;  } else  { j = j-1;  } 

} 

else 

{ 

S_[j]  = n - kmin  + j; 

if  ( S_  [j  — 1]  ==S_  [j]  — 1 ) { j = j-2;  } 


8.5:  k-subsets  where  k lies  in  a given  range 
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16 

} 

17 

} 

18 

else 

19 

{ 

20 

S_[jl  = S_[j] 

- 1; 

21 

if  ( j < kmax 

) 

22 

{ 

23 

S_[j+U  = 

S_[jl  + 1; 

24 

if  ( j >= 

kmin-1  ) { 

25 

} 

26 

} 

27 

} 

j = j+1;  > else  { j = j+2;  } 


1 void  prev_odd() 

2 { 


3 

ulong  &n=n_ , &kmin=kmin_ , &kmax=kmax_ , 

&j  = 

=j-; 

4 

if  ( S_  [j]  ==  n ) { j = j-1 ; } 

5 

else 

6 

{ 

7 

if  ( j < kmax  ) 

8 

{ 

9 

S_[j  + 1]  = n; 

10 

j = j+i; 

11 

} 

12 

else 

13 

{ 

14 

S_[j]  - S_[j]+1; 

15 

if  ( S [kmin]  ==n  ) { j = 

j-i; 

> 

16 

} 

17 

} 

18 

} 

19 

[ — snip — ] 

The  nextO  and  prevO  functions  use  these  routines. 

Note  that  i 

1 

ulong  prevO 

2 

1 

3 

if  ( is_first()  ) { lastO;  return  0; 

> 

4 

if  ( j_&l  ) prev_odd(); 

5 

else  prev_even() ; 

6 

if  ( j_<kmin_  ) { k_  = kmin_;  } 

else 

k-  = j_ 

7 

return  k ; 

8 

} 

1 

ulong  nextO 

2 

1 

3 

if  ( is_last()  ) { firstO;  return  0: 

; 1 

4 

if  ( j_&l  ) prev_even() ; 

5 

else  prev_odd(); 

6 

if  ( j_<kmin_  ) { k_  = kmin_;  } 

else 

{ 

k-  = j_ 

}; 


>; 


1 


return  k_ 


[ — snip — ] 


Usage  of  the  class  is  shown  in  the  program  [FXT:  comb/ksubset-gray-demo.cc  , the  fc-subsets  where 
2 < k < 4 in  the  order  generated  by  the  algorithm  are  shown  in  figure  |8.5-D|  About  150  million  subsets 
per  second  can  be  generated  with  the  routine  nextO  and  130  million  with  prev(). 


8.5.3  A two-close  order  with  homogenous  moves 

Orderings  of  the  £:-subsets  with  k in  a given  range  that  are  two-close  are  shown  in  figure  |8.5-E|  one 
element  is  inserted  or  removed  or  moves  by  at  most  two  positions.  The  moves  by  two  positions  only 
cross  a zero,  the  changes  are  homogenous.  The  list  was  produced  with  the  program  [FXT:  comb/ksubset- 
twoclose-demo.cc  which  uses  [FXT:  class  ksubset_twoclose  in  comb/ksubset-twoclose.h  : 

1 class  ksubset_twoclose 

2 //  k-subsets  (kmin<=k<=kmax)  in  a two-close  order. 

3 //  Recursive  algorithm. 

4 { 

5 public: 

6 ulong  *rv_;  //  record  of  visits  in  graph  (delta  set) 

7 ulong  n_;  //  subsets  of  the  n-element  set 

8 

9 //  function  to  call  with  each  combination: 
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Figure  8.5-E:  The  fc-subsets  where  2 < k < 4 of  5 elements  (left)  and  the  sets  where  1 < k < 2 of  6 


elements  (right)  in  two-close  orders. 


10  void  (*visit_) (const  ksubset_twoclose  &) ; 

11  [ — snip — ] 

12 

13  void  generate (void  (*visit) (const  ksubset_twoclose  &)  , 

14  ulong  kmin,  ulong  kmax) 

15  1 

16  visit_  = visit; 

17  ulong  kmaxO  = n_  - kmin; 

18  next_rec(n_,  kmax,  kmaxO,  0); 

19  > 


1 

2 

3 

4 

5 

6 

7 

8 

18 

11 

12 
13 


The  recursion  is: 
private : 

void  next_rec (ulong  d,  ulong  nl,  ulong  nO,  bool  q) 


21 

22 

23 


//  d: 
//  nl: 
//  nO: 
//  q: 

{ 


remaining  depth  in  recursion 
remaining  ones  to  fill  in 
remaining  zeros  to  fill  in 
direction  in  recursion 


if  ( 0==d  ) { visit_ (*this) ; return;  } 


-d; 


if 

{ 


( q ) 


14 

if 

( 

nO 

) 

{ 

rv_  [d]  =0 ; 

next. 

_rec(d, 

nl-0 , 

n0-l , 

d&l)  ; 

> 

15 

if 

( 

nl 

) 

{ 

rv_  [d]  =1 ; 

next. 

_rec(d, 

nl-1 , 

n0-0, 

q);  > 

16 

} 

17 

else 

18 

{ 

19 

if 

( 

nl 

) 

{ 

rv_  [d]  =1 ; 

next. 

_rec(d, 

nl-1 , 

n0-0, 

q);  > 

20 

if 

( 

nO 

) 

{ 

rv_  [d]  =0 ; 

next. 

_rec(d, 

nl-0, 

n0-l , 

d&l)  ; 

> 

About  75  million  subsets  per  second  can  be  generated.  For  kmin  = kmax  =:  k we  obtain  the  enup  order 
for  combinations  described  in  section  |6.6.2  on  page  188} 
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Chapter  9 


Mixed  radix  numbers 


The  mixed  radix  representation  A = [do,  di,  02, , an_i]  of  a number  x with  respect  to  a radix  vector 
M = [mo,  mi,  m2, . • • , m„_i]  is  given  by 

n— 1 k—  1 

x = ^ a*,  nij  (9.0-1) 

k— 0 J=0 

where  0 < aj  < mj  (and  0 < x < Jly  =0  mi>  so  that  n digits  suffice).  For  M = [r,r,r, . . . ,r\  the  relation 
reduces  to  the  radix-r  representation: 

n—  1 

x = ^ dfc  rfc  (9.0-2) 

k- 0 

All  3-digit  radix-4  numbers  are  shown  in  various  orders  in  figure  |9.0-A|  Note  that  the  least  significant 
digit  (do)  is  at  the  left  side  of  each  number  (array  representation). 

9.1  Counting  (lexicographic)  order 

An  implementation  for  mixed  radix  counting  is  [FXT:  class  mixedradix_lex  in  comb/mixedradix-lex.h  : 

1 class  mixedradix_lex 

2 { 

3 public: 

4 ulong  *a_;  //  digits 

5 ulong  *ml_;  //  radix  (minus  one)  for  each  digit 

6 ulong  n_;  //  Number  of  digits 

7 ulong  j_;  //  position  of  last  change 

f)  public : 


10 

mixedradix  lex(const  ulong  *m, 

ulong  n, 

ulong  mm=0) 

11 

1 

12 

n_  = n; 

13 

a_  = new  ulong [n_+l] ; 

14 

ml_  = new  ulong [n_+l] ; 

15 

a_ [n_]  =1;  //  sentinel: 

! =0 , and 

! =ml  [n] 

16 

ml_  [n_]  =0;  //  sentinel 

17 

mixedradix_init (n_ , mm,  m, 

ml_)  ; 

18 

first  () ; 

19 

> 

20 

[ — snip — ] 

The  initialization  routine  mixedradix_init  ()  is  given  in  [FXT:  comb/mixedradix-init.cc  : 

1 void 

2 mixedradix_init (ulong  n,  ulong  mm,  const  ulong  *m,  ulong  *ml) 

3 //  Auxiliary  function  used  to  initialize  vector  of  nines  in  mixed  radix  classes. 

4 { 

5 if  ( m ) //  all  radices  given 

6 { 

7 for  (ulong  k=0;  k<n;  ++k)  ml  [k]  = m[k]  - 1; 

8 > 

9 else 


218 


Chapter  9:  Mixed  radix  numbers 


counting 

Gray 

modular  Gray 

gslex 

endo 

endo  Gray 

0 

[ . 

] 

[ . 

. ] 

[ 

1 

[ 1 

. 1 

[ . 

. ] 

[ 

1 

1 

[ 1 

] 

[ 1 

. 1 

[ 

1 

1 

[ 2 

. ] 

[ 1 

. ] 

[ 

1 

1 

2 

[ 2 

] 

[ 2 

. 1 

[ 

2 

] 

[ 3 

. ] 

[ 3 

. ] 

[ 

3 

] 

3 

[ 3 

] 

[ 3 

. 1 

[ 

3 

] 

[ 1 

1 

. ] 

[ 2 

. ] 

[ 

2 

] 

4 

[ . 

1 

] 

[ 3 

1 

. 1 

[ 

3 

1 

1 

[ 2 

1 

. ] 

[ • 

1 

. ] 

[ 

2 

1 

1 

5 

[ 1 

1 

1 

[ 2 

1 

. 1 

[ 

1 

] 

[ 3 

1 

. ] 

[ 1 

1 

. ] 

[ 

3 

1 

1 

6 

[ 2 

1 

1 

[ 1 

1 

. 1 

[ 

i 

1 

] 

[ . 

1 

. 1 

[ 3 

1 

. ] 

[ 

1 

1 

1 

7 

[ 3 

1 

] 

[ . 

1 

. 1 

[ 

2 

1 

] 

[ 1 

2 

. ] 

[ 2 

1 

. ] 

[ 

1 

] 

8 

[ . 

2 

] 

[ . 

2 

. 1 

[ 

2 

2 

] 

[ 2 

2 

. ] 

[ . 

3 

. ] 

[ 

3 

1 

9 

[ 1 

2 

] 

[ 1 

2 

. ] 

[ 

3 

2 

1 

[ 3 

2 

. ] 

[ 1 

3 

. ] 

[ 

i 

3 

] 

10 

[ 2 

2 

] 

[ 2 

2 

. 1 

[ 

2 

1 

[ . 

2 

. ] 

[ 3 

3 

. ] 

[ 

3 

3 

1 

11 

[ 3 

2 

] 

[ 3 

2 

. ] 

[ 

i 

2 

] 

[ 1 

3 

. ] 

[ 2 

3 

. ] 

[ 

2 

3 

] 

12 

[ . 

3 

] 

[ 3 

3 

. ] 

[ 

1 

3 

] 

[ 2 

3 

. ] 

[ . 

2 

. ] 

[ 

2 

2 

] 

13 

[ 1 

3 

] 

[ 2 

3 

. 1 

[ 

2 

3 

] 

[ 3 

3 

. ] 

[ 1 

2 

. ] 

[ 

3 

2 

1 

14 

[ 2 

3 

] 

[ 1 

3 

. 1 

[ 

3 

3 

] 

[ . 

3 

. ] 

[ 3 

2 

. ] 

[ 

1 

2 

1 

15 

[ 3 

3 

1 

[ . 

3 

. 1 

[ 

3 

1 

[ 1 

1 1 

[ 2 

2 

. ] 

[ 

2 

1 

16 

[ . 

1 

] 

[ . 

3 

1 ] 

[ 

3 

1 

] 

[ 2 

1 ] 

[ . 

1 ] 

[ 

2 

1 

] 

17 

[ 1 

1 

] 

[ 1 

3 

1 ] 

[ 

i 

3 

1 

] 

[ 3 

1 1 

[ 1 

1 ] 

[ 

i 

2 

1 

1 

18 

[ 2 

1 

] 

[ 2 

3 

1 ] 

[ 

2 

3 

1 

1 

[ 1 

i 

1 ] 

[ 3 

1 ] 

[ 

3 

2 

1 

1 

19 

[ 3 

1 

] 

[ 3 

3 

1 ] 

[ 

3 

3 

1 

] 

[ 2 

1 

1 ] 

[ 2 

1 ] 

[ 

2 

2 

1 

] 

20 

[ . 

i 

1 

] 

[ 3 

2 

1 ] 

[ 

3 

1 

] 

[ 3 

1 

1 ] 

[ . 

i 

1 ] 

[ 

2 

3 

1 

] 

21 

[ 1 

1 

1 

] 

[ 2 

2 

1 ] 

[ 

1 

] 

[ . 

1 

1 ] 

[ 1 

1 

1 ] 

[ 

3 

3 

1 

1 

22 

[ 2 

1 

1 

] 

[ 1 

2 

1 ] 

[ 

i 

1 

] 

[ 1 

2 

1 1 

[ 3 

1 

1 ] 

[ 

1 

3 

1 

] 

23 

[ 3 

1 

1 

] 

[ . 

2 

1 ] 

[ 

2 

1 

] 

[ 2 

2 

1 ] 

[ 2 

1 

1 ] 

[ 

3 

1 

] 

24 

[ . 

2 

1 

1 

[ . 

1 

1 ] 

[ 

2 

i 

1 

] 

[ 3 

2 

1 ] 

[ • 

3 

1 ] 

[ 

1 

1 

1 

25 

[ 1 

2 

1 

] 

[ 1 

1 

1 ] 

[ 

3 

1 

1 

1 

[ . 

2 

1 1 

[ 1 

3 

1 ] 

[ 

i 

1 

1 

] 

26 

[ 2 

2 

1 

] 

[ 2 

1 

1 ] 

[ 

1 

1 

1 

[ 1 

3 

1 ] 

[ 3 

3 

1 ] 

[ 

3 

1 

1 

1 

27 

[ 3 

2 

1 

] 

[ 3 

1 

1 ] 

[ 

i 

1 

1 

1 

[ 2 

3 

1 ] 

[ 2 

3 

1 ] 

[ 

2 

1 

1 

] 

28 

[ . 

3 

1 

1 

[ 3 

1 ] 

[ 

1 

2 

1 

1 

[ 3 

3 

1 ] 

[ . 

2 

1 ] 

[ 

2 

1 

1 

29 

[ 1 

3 

1 

] 

[ 2 

1 ] 

[ 

2 

2 

1 

1 

[ . 

3 

1 ] 

[ 1 

2 

1 ] 

[ 

3 

1 

1 

30 

[ 2 

3 

1 

] 

[ 1 

1 ] 

[ 

3 

2 

1 

] 

[ . 

1 ] 

[ 3 

2 

1 ] 

[ 

1 

1 

1 

31 

[ 3 

3 

1 

] 

[ . 

1 ] 

[ 

2 

1 

] 

[ 1 

2 ] 

[ 2 

2 

1 ] 

[ 

1 

] 

32 

[ . 

2 

] 

[ . 

2 ] 

[ 

2 

2 

] 

[ 2 

2 ] 

[ • 

3 ] 

[ 

3 

1 

33 

[ 1 

2 

1 

[ 1 

2 ] 

[ 

i 

2 

2 

] 

[ 3 

2 ] 

[ 1 

3 ] 

[ 

i 

3 

1 

34 

[ 2 

2 

] 

[ 2 

2 ] 

[ 

2 

2 

2 

1 

[ 1 

i 

2 ] 

[ 3 

3 ] 

[ 

3 

3 

] 

35 

[ 3 

2 

] 

[ 3 

2 ] 

[ 

3 

2 

2 

] 

[ 2 

1 

2 ] 

[ 2 

3 ] 

[ 

2 

3 

1 

36 

[ . 

i 

2 

] 

[ 3 

i 

2 ] 

[ 

3 

3 

2 

1 

[ 3 

1 

2 ] 

[ • 

i 

3 ] 

[ 

2 

i 

3 

1 

37 

[ 1 

1 

2 

] 

[ 2 

1 

2 ] 

[ 

3 

2 

] 

[ . 

1 

2 ] 

[ 1 

1 

3 ] 

[ 

3 

1 

3 

1 

38 

[ 2 

1 

2 

] 

[ 1 

1 

2 ] 

[ 

i 

3 

2 

] 

[ 1 

2 

2 ] 

[ 3 

1 

3 ] 

[ 

1 

1 

3 

] 

39 

[ 3 

1 

2 

] 

[ ■ 

1 

2 ] 

[ 

2 

3 

2 

] 

[ 2 

2 

2 ] 

[ 2 

1 

3 ] 

[ 

1 

3 

] 

40 

[ . 

2 

2 

] 

[ . 

2 

2 ] 

[ 

2 

2 

] 

[ 3 

2 

2 ] 

[ . 

3 

3 ] 

[ 

3 

3 

] 

41 

[ 1 

2 

2 

] 

[ 1 

2 

2 ] 

[ 

3 

2 

] 

[ . 

2 

2 ] 

[ 1 

3 

3 ] 

[ 

i 

3 

3 

1 

42 

[ 2 

2 

2 

1 

[ 2 

2 

2 ] 

[ 

2 

] 

[ 1 

3 

2 ] 

[ 3 

3 

3 ] 

[ 

3 

3 

3 

1 

43 

[ 3 

2 

2 

] 

[ 3 

2 

2 ] 

[ 

i 

2 

1 

[ 2 

3 

2 ] 

[ 2 

3 

3 ] 

[ 

2 

3 

3 

] 

44 

[ . 

3 

2 

] 

[ 3 

3 

2 ] 

[ 

1 

i 

2 

] 

[ 3 

3 

2 ] 

[ . 

2 

3 ] 

[ 

2 

2 

3 

1 

45 

[ 1 

3 

2 

] 

[ 2 

3 

2 ] 

[ 

2 

1 

2 

1 

[ . 

3 

2 ] 

[ 1 

2 

3 ] 

[ 

3 

2 

3 

] 

46 

[ 2 

3 

2 

] 

[ 1 

3 

2 ] 

[ 

3 

1 

2 

] 

[ . 

2 ] 

[ 3 

2 

3 ] 

[ 

1 

2 

3 

] 

47 

[ 3 

3 

2 

] 

[ ■ 

3 

2 ] 

[ 

1 

2 

1 

[ 1 

3 ] 

[ 2 

2 

3 ] 

[ 

2 

3 

] 

48 

[ . 

3 

] 

[ . 

3 

3 ] 

[ 

1 

3 

] 

[ 2 

3 ] 

[ • 

2 ] 

[ 

2 

2 

] 

49 

[ 1 

3 

] 

[ 1 

3 

3 ] 
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i 

1 

3 

] 

[ 3 

3 ] 

[ 1 

2 ] 

[ 

i 

2 

2 

] 

50 

[ 2 

3 

] 

[ 2 

3 

3 ] 

[ 

2 

1 

3 

] 

[ 1 

i 

3 ] 

[ 3 

2 ] 

[ 

3 

2 

2 

] 

51 

[ 3 

3 

1 

[ 3 

3 

3 ] 

[ 

3 

1 

3 

] 

[ 2 

1 

3 ] 

[ 2 

2 ] 

[ 

2 

2 

2 

1 

52 

[ . 

i 

3 

] 

[ 3 

2 

3 ] 

[ 

3 

2 

3 

] 

[ 3 

1 

3 ] 

[ . 

i 

2 ] 

[ 

2 

3 

2 

1 

53 

[ 1 

1 

3 

] 

[ 2 

2 

3 ] 

[ 

2 

3 

] 

[ . 

1 

3 ] 

[ 1 

1 

2 ] 

[ 

3 

3 

2 

] 

54 

[ 2 

1 

3 

] 

[ 1 

2 

3 ] 

[ 

i 

2 

3 

] 

[ 1 

2 

3 ] 

[ 3 

1 

2 ] 

[ 

1 

3 

2 

1 

55 

[ 3 

1 

3 

] 

[ . 

2 

3 ] 

[ 

2 

2 

3 

1 

[ 2 

2 

3 ] 

[ 2 

1 

2 ] 

[ 

3 

2 

1 

56 

[ . 

2 

3 

] 

[ • 

1 

3 ] 

[ 

2 

3 

3 

1 

[ 3 

2 

3 ] 

[ . 

3 

2 ] 

[ 

1 

2 

1 

57 

[ 1 

2 

3 

] 

[ 1 

1 

3 ] 

[ 

3 

3 

3 

] 

[ . 

2 

3 ] 

[ 1 

3 

2 ] 

[ 

i 

1 

2 

] 

58 

[ 2 

2 

3 

] 

[ 2 

1 

3 ] 

[ 

3 

3 

1 

[ 1 

3 

3 ] 

[ 3 

3 

2 ] 

[ 

3 

1 

2 

] 

59 

[ 3 

2 

3 

] 

[ 3 

1 

3 ] 

[ 

i 

3 

3 

] 

[ 2 

3 

3 ] 

[ 2 

3 

2 ] 

[ 

2 

1 

2 

1 

60 

[ . 

3 

3 

1 

[ 3 

3 ] 

[ 

1 

3 

] 

[ 3 

3 

3 ] 

[ • 

2 

2 ] 

[ 

2 

2 

1 

61 
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3 

3 

] 

[ 2 

3 ] 

[ 

2 

3 

] 

[ . 

3 

3 ] 

[ 1 

2 

2 ] 

[ 

3 

2 

] 

62 

[ 2 

3 

3 

] 

[ 1 

3 ] 

[ 

3 

3 

] 

[ . 

3 ] 

[ 3 

2 

2 ] 

[ 

1 

2 

] 

63 

[ 3 

3 

3 

] 

[ . 

3 ] 

[ 

3 

] 

[ . 

. ] 

[ 2 

2 

2 ] 

[ 

2 

1 

Figure  9.0- A:  All  3-digit,  radix-4  numbers  in  various  orders  (dots  denote  zeros):  counting-,  Gray-, 
modular  Gray-,  gslex-,  endo-,  and  endo  Gray  order.  The  least  significant  digit  is  on  the  left  of  each  word 
(array  notation). 
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M=[ 

2 

3 

4 ] 

M=  [ 

4 

3 

2 

] 

0 

[ 

. ] 

[ 

] 

1 

[ 

i 

. ] 

[ 

i 

] 

2 

[ 

i 

. ] 

[ 

2 

] 

3 

[ 

i 

1 

. ] 

[ 

3 

] 

4 

[ 

2 

. ] 

[ 

i 

] 

5 

[ 

i 

2 

. ] 

[ 

i 

1 

] 

6 

[ 

1 ] 

[ 

2 

1 

] 

7 

[ 

i 

1 ] 

[ 

3 

1 

] 

8 

[ 

i 

1 ] 

[ 

2 

] 

9 

[ 

i 

1 

1 ] 

[ 

i 

2 

] 

10 

[ 

2 

1 ] 

[ 

2 

2 

] 

11 

[ 

i 

2 

1 ] 

[ 

3 

2 

] 

12 

[ 

2 ] 

[ 

i 

] 

13 

[ 

i 

2 ] 

[ 

i 

1 

] 

14 

[ 

i 

2 ] 

[ 

2 

1 

] 

15 

[ 

i 

1 

2 ] 

[ 

3 

1 

] 

16 

[ 

2 

2 ] 

[ 

i 

1 

] 

17 

[ 

i 

2 

2 ] 

[ 

i 

1 

1 

] 

18 

[ 

3 ] 

[ 

2 

1 

1 

] 

19 

[ 

i 

3 ] 

[ 

3 

1 

1 

] 

20 

[ 

i 

3 ] 

[ 

2 

1 

] 

21 

[ 

i 

1 

3 ] 

[ 

i 

2 

1 

] 

22 

[ 

2 

3 ] 

[ 
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2 

1 

] 

23 

[ 

i 

2 

3 ] 

[ 

3 

2 

1 

] 

Figure  9.1- A:  Mixed  radix  numbers  in  counting  (lexicographic)  order,  dots  denote  zeros.  The  radix 


vectors  are  M = [2,3,4]  (rising  factorial  base,  left)  and  M = [4,3,2]  (falling  factorial  base,  right).  The 
least  significant  digit  is  on  the  left  of  each  word  (array  notation). 


10 

1 

11 

if 

( mm>l  ) //  use  mm  as  radix  for  all  digits 

12 

for  (ulong  k=0;  k<n;  ++k)  ml  [k]  = mm  - 1 ; 

13 

else 

14 

{ 

15 

if  ( mm==0  ) //  falling  factorial  base 

16 

for  (ulong  k=0;  k<n;  ++k)  ml [k]  = n - 

17 

else  //  rising  factorial  base 

18 

for  (ulong  k=0;  k<n;  ++k)  ml [k]  = k + 

19 

} 

20 

> 

21  } 

- k; 
1; 


Instead  of  the  vector  of  radices  M = [mo,  mi,  m2, . . . , mn_i]  the  vector  of  ‘nines’  (. M ' = [mo  — l,mi  — 
1,  m-2  — 1, ... , mn- 1 — 1],  variable  ml_)  is  used.  This  modification  leads  to  slightly  faster  generation.  The 
first  n-digit  in  lexicographic  order  number  is  all-zero,  the  last  is  all-nines: 

1 [ — snip — ] 

2 void  first () 

3 1 

4 for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = 0; 

5 j_  = n_; 

6 } 

7 

8 void  lastO 

9 f 

10  for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = ml_ [k] ; 

11  j-  = n_; 

12  > 

13  [ — snip — ] 

A number  is  incremented  by  setting  all  nines  (digits  aj  that  are  equal  to  mj  — 1)  at  the  lower  end  to  zero 
and  incrementing  the  next  digit: 

1 bool  next()  //  increment 

2 1 

3 ulong  j = 0 ; 

4 while  ( a_  [j]  ==ml_  [j]  ) { a_  [ j ] =0 ; ++j  ; } //  can  touch  sentinels 


7 if  ( j==n_  ) return  false;  //  current  is  last 

8 
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9 ++a_ [j]  ; 

10  return  true ; 

11  > 

12  [ — snip — ] 


A number  is  decremented  by  setting  all  zero  digits  at  the  lower  end  to  nine  and  decrementing  the  next 
digit: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


bool  prevO  //  decrement 

{ 

ulong  j = 0; 

while  ( a_[j]==0  ) { a_  [j]  =ml_  [j]  ; ++j  ; } //  can  touch  sentinels 

j-  = r> 

if  ( j==n_  ) return  false;  //  current  is  first 

— a_[j]  ; 

return  true ; 

> 

[ — snip — ] 


Figure  9.1-A  shows  the  3-digit  mixed  radix  numbers  for  bases  M = [2,  3, 4]  (left)  and  M = [4, 3, 2]  (right). 
The  listings  were  created  with  the  program  [FXT:  comb/mixedradix-lex-demo.cc  . 


The  rate  of  generation  for  the  routine  next()  is  about  166  M/s  (with  radix-2  numbers,  M = 
[2, 2,  2, . . . , 2]),  257  M/s  (radix-3),  and  about  370  M/s  (radix-8).  The  slowest  generation  occurs  for 
radix-2,  as  the  number  of  carries  is  maximal.  The  number  C of  carries  with  incrementing  is  on  average 


C = 


(9.1-1) 


The  number  of  digits  changed  on  average  equals  C + 1.  For  M = [r,  r,  r, . . . , r\  (and  n = oo)  we  have 
C = / 1 . For  the  worst  case  (r  = 2)  we  have  C = 1,  so  two  digits  are  changed  on  average. 


9.2  Minimal-change  (Gray  code)  order 


9.2.1  Constant  amortized  time  (CAT)  algorithm 


Figure  9.2-A  shows  the  3-digit  mixed  radix  numbers  for  radix  vectors  M = [2, 3, 4]  (left)  and  M = [4,  3, 2] 
(right)  in  Gray  code  order.  A generator  for  the  Gray  code  order  is  [FXT:  class  mixedradix_gray  in 
comb/mixedradix-gray.h  : 


1 

2 

3 

4 

5 

6 

7 

8 
9 

1? 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 


class  mixedradix_gray 

{ 

public : 

ulong  *a_;  //  mixed  radix  digits 

ulong  *ml_;  //  radices  (minus  one) 

ulong  *i_;  //  direction 

ulong  n_;  //  n_  digits 

ulong  j_;  //  position  of  last  change 

int  dm_;  //  direction  of  last  move 

public : 

mixedradix_gray (const  ulong  *m,  ulong  n,  ulong  mm=0) 

f 

n_  = n; 

a_  = new  ulong [n_+l] ; 
a_ [n]  = -1UL;  //  sentinel 

i_  = new  ulong [n_+l] ; 
i_ [n_]  =0;  //  sentinel 

ml_  = new  ulong  [n_+l] ; 

mixedradix_init (n_ , mm,  m,  ml_) ; 

f irst  0 ; 

> 

[ — snip — ] 
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M=[ 

2 

3 

4 ] 

X 

j 

d 

M=  [ 

4 

3 

2 ] 

X 

j 

d 

0 

[ 

. 1 

0 

[ 

. ] 

0 

1 

[ 

i 

. 1 

1 

0 

1 

[ 

i 

. ] 

1 

0 

1 

2 

[ 

1 

i 

. 1 

3 

1 

1 

[ 

2 

. ] 

2 

0 

1 

3 

[ 

1 

. 1 

2 

0 

-1 

[ 

3 

. ] 

3 

0 

1 

4 

[ 

2 

. 1 

4 

1 

1 

[ 

3 

i 

. ] 

7 

1 

1 

5 

[ 

i 

2 

. 1 

5 

0 

1 

[ 

2 

1 

. ] 

6 

0 

-1 

6 

[ 

1 

2 

1 1 

11 

2 

1 

[ 

1 

1 

. 1 

5 

0 

-1 

7 

[ 

2 

1 ] 

10 

0 

-1 

[ 

1 

. ] 

4 

0 

-1 

8 

[ 

1 

1 ] 

8 

1 

-1 

[ 

2 

. 1 

8 

1 

1 

9 

[ 

i 

1 

1 1 

9 

0 

1 

[ 

i 

2 

. ] 

9 

0 

1 

10 

[ 

1 

1 ] 

7 

1 

-1 

[ 

2 

2 

. 1 

10 

0 

1 

11 

[ 

1 1 

6 

0 

-1 

[ 

3 

2 

. ] 

11 

0 

1 

12 

[ 

2 ] 

12 

2 

1 

[ 

3 

2 

1 1 

23 

2 

1 

13 

[ 

i 

2 ] 

13 

0 

1 

[ 

2 

2 

1 ] 

22 

0 

-1 

14 

[ 

1 

i 

2 ] 

15 

1 

1 

[ 

1 

2 

1 ] 

21 

0 

-1 

15 

[ 

1 

2 ] 

14 

0 

-1 

[ 

2 

1 1 

20 

0 

-1 

16 

[ 

2 

2 ] 

16 

1 

1 

[ 

1 

1 ] 

16 

1 

-1 

17 

[ 

i 

2 

2 ] 

17 

0 

1 

[ 

i 

1 

1 ] 

17 

0 

1 

18 

[ 

1 

2 

3 ] 

23 

2 

1 

[ 

2 

1 

1 ] 

18 

0 

1 

19 

[ 

2 

3 ] 

22 

0 

-1 

[ 

3 

1 

1 ] 

19 

0 

1 

20 

[ 

1 

3 ] 

20 

1 

-1 

[ 

3 

1 ] 

15 

1 

-1 

21 

[ 

i 

1 

3 ] 

21 

0 

1 

[ 

2 

1 1 

14 

0 

-1 

22 

[ 

1 

3 ] 

19 

1 

-1 

[ 

1 

1 1 

13 

0 

-1 

23 
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3 ] 

18 

0 

-1 

[ 

1 1 

12 

0 

-1 

Figure  9.2-A:  Mixed  radix  numbers  in  Gray  code  order,  dots  denote  zeros.  The  radix  vectors  are 
M = [2,3,4]  (left)  and  M = [4,3,2]  (right).  Columns  ‘x’  give  the  values,  columns  ‘ j ’ and  ‘d’  give  the 
position  of  last  change  and  its  direction,  respectively. 


The  array  i_  []  contains  the  ‘directions’  for  each  digit:  it  contains  +1  or  -1  if  the  computation  of  the 
successor  will  increase  or  decrease  the  corresponding  digit.  It  has  to  be  filled  when  the  first  or  last  number 
is  computed: 


1 

void  first () 

2 

I 

3 

for  (ulong  k=0;  k<n_;  ++k)  a_  [k]  = 

4 

for  (ulong  k=0;  k<n_;  ++k)  i_[k]  = 

5 

j-  = n_; 

6 

dm_  = 0 ; 

7 

> 

8 

9 

void  lastO 

10 

I 

11 

//  find  position  of  last  even  radix 

12 

ulong  z = 0 ; 

13 

for  (ulong  i=0;  i<n_;  ++i)  if  ( ml. 

14 

while  ( z<n  ) //  last  even  . . end 

15 

{ 

16 

a_  [z]  = ml_  [z]  ; 

17 

i_[z]  = +1; 

18 

++z ; 

19 

on 

} 

2? 

j-  = 0; 

22 

dm_  = -1; 

23 

} 

24 

[— i 

snip — ] 

A sentinel  element  (i_[n]=0)  is  used  to  optimize  the  computations  of  the  successor  and  predecessor. 
The  method  works  in  constant  amortized  time: 

1 bool  next() 

2 { 

3 ulong  j = 0 ; 

4 ulong  i j ; 

5 while  ( ( i j =i_  [ j ] ) ) //  can  touch  sentinel  i [n] ==0 

6 { 

7 ulong  dj  = a_[j]  + ij  ; 

8 if  ( dj>ml_  [j]  ) //  =“=  if  ( (dj>ml_[j])  I I ((long)dj<0)  ) 

9 { 

10 


i_[j]  = -i j ; //  flip  direction 
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11 

} 

12 

else  //  can  update 

13 

{ 

14 

a_[j]  = dj; 

//  update  digit 

15 

■i  : 
■H 

II 

1 

1 

//  save  for  dir() 

16 

j-  = j; 

//  save  for  pos() 

17 

return  true; 

18 

} 

++j ; 

21 

} 

22 

return  false; 

23 

> 

24 

[ — snip — ] 

Note  the  if-clause:  it  is  an  optimized  expression  equivalent  to  the  one  given  as  comment.  The  following 
methods  are  often  useful: 

1 ulong  pos()  const  { return  j_;  } //  position  of  last  change 

2 int  dirO  const  { return  dm_;  } //  direction  of  last  change 

The  routine  for  the  computation  of  the  predecessor  is  obtained  by  changing  the  plus  sign  in  the  statement 
ulong  dj  = a_  [j]  + ij  ; to  a minus  sign.  The  rate  of  generation  is  about  128  M/s  for  radix  2,  243  M/s 
for  radix  4,  and  304  M/s  for  radix  8 [FXT:  comb/mixedradix-gray-demo.cc  . 


9.2.2  Loopless  algorithm 


A loopless  algorithm  for  the  computation  of  the  successor,  taken  from  i|2 1 5 [ alg.H,  sect. 7. 2. 1.1],  is  given 
in  [FXT:  comb/mixedradix-gray2.h  : 


1 

2 

3 

4 

5 

6 

7 


10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

i 

43 


class  mixedradix_gray2 

{ 

public : 

ulong  *a_;  //  digits 

ulong  *ml_;  //  radix  minus  one  (’nines’) 
ulong  *f_;  //  focus  pointer 

ulong  *d_;  //  direction 

ulong  n_;  //  number  of  digits 

ulong  j_;  //  position  of  last  change 

int  dm_ ; //  direction  of  last  move 

[ — snip — ] 
void  first () 

{ 


for  (ulong  k=0;  k<n_ ; ++k)  a_ [k]  = 0; 
for  (ulong  k=0;  k<n_;  ++k)  d_ [k]  = 1; 
for  (ulong  k=0;  k<=n_;  ++k)  f _ [k]  = k; 
dm_  = 0 ; 

j-  = n_; 


} 


bool  next() 

{ 

const  ulong  j = f _ [0]  ; 

f _ CO]  = 0; 


if  ( j>=n_  ) { firstO;  return  false;  } 

const  ulong  dj  = d_ [ j ] ; 
const  ulong  aj  = a_[j]  + dj  ; 
a_  [j]  = a j ; 

dm_  = (int)dj;  //  save  for  dir() 
j_  = j;  //  save  for  pos() 

if  ( aj+dj  > ml_[j]  ) //  was  last  move? 

{ 

d_[j]  = -dj ; //  change  direction 

f _ [ j ] = f_[j  + l];  //  lookup  next  position 

f_[j+i]  = j + i; 

} 


> 


return  true ; 
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The  rate  of  generation  is  about  120  M/s  for  radix  2,  194  M/s  for  radix  4,  and  264  M/s  for  radix  8 [FXT: 
comb /nrixedradix-gray 2-demo . cc  . 


9.2.3  Modular  Gray  code  order 
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Figure  9.2-B:  Mixed  radix  numbers  in  modular  Gray  code  order,  dots  denote  zeros.  The  radix  vectors 


are  M = [2,3,4]  (left)  and  M = [4,3,2]  (right).  The  columns  ‘j’  give  the  position  of  last  change. 

Figure [9. 2-B| shows  the  modular  Gray  code  order  for  3-digit  mixed  radix  numbers  with  radix  vectors  M = 
[2, 3, 4]  (left)  and  M = [4, 3,  2]  (right).  The  transitions  are  either  k —>  k+1  or,  if  k is  maximal,  k — ► 0.  The 
mixed  radix  modular  Gray  code  can  be  generated  as  follows  [FXT:  class  mixedradix  jnodular_gray2 
in  comb/mixedradix-modular-gray2.h  : 


1 

class  mixedradix  modular  gray2 

2 

{ 

3 

public : 

4 

ulong  *a_;  //  digits 

5 

ulong  *ml_;  //  radix  minus  one 

(’nines’ ) 

6 

ulong  *x_ ; //  count  changes  of 

digit 

7 

ulong  n_;  //  number  of  digits 

8 

o 

ulong  j_;  //  position  of  last 

change 

18 

public : 

it 

mixedradix  modular  gray2 (ulong 

n,  ulong  mm,  const 

; ulong  *m=0) 

12 

■c 

13 

n_  = n; 

14 

a_  = new  ulong [n_] ; 

15 

ml_  = new  ulong  [n_+l] ; // 

incl . sentinel 

at 

ml  [n] 

16 

x_  = new  ulong [n_+l] ; // 

incl . sentinel 

at 

x [n]  ( ! = ml  [n]  ) 

17 

18 

mixedradix_init (n_ , mm,  m, 

ml_)  ; 

19 

20 

first () ; 

21 

} 

22 

[ — snip — ] 

The  computation  of  the  successor  works  in  constant  amortized  time 

1 bool  next() 

2 { 

3 ulong  j = 0; 

4 while  ( x_[j]  ==  ml_[j]  ) //  can  touch  sentinels 

5 { 

6 


x_[j]  = 0; 
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7 

++j ; 

8 

} 

9 

++X_[j]  ; 

10 

11 

if  ( j==n_  ) { 

firstO;  return  false;  } 

12 

13 

j _ = j ; //  save 

position  of  change 

14 

15 

//  increment : 

16 

ulong  aj  = a_  [j] 

+ l; 

17 

if  ( aj >ml_  [j]  ) 

aj  = 0; 

18 

£8 

a_ [ j ] = a j ; 

return  true ; 

21 

} 

22 

[ — snip — ] 

//  current  is  last 


The  rate  of  generation  is  about  151  M/s  for  radix  2,  254  M/s  for  radix  4,  and  267  M/s  for  radix  8 [FXT: 
comb /nrixedradix-modular-gray2-demo.cc  . 

The  loopless  implementation  [FXT:  class  mixedradix_modular_gray  in  comb/mixedradix-modular- 
gray.h  was  taken  from  |215l  ex. 77,  sect. 7. 2. 1.1].  The  rate  of  generation  is  about  169  M/s  with  radix  2, 
197  M/s  with  radix  4,  and  256  M/s  with  radix  8 [FXT:  comb/mixedradix-modular-gray-demo.cc  . 


9.3  gslex  order 
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Figure  9.3-A:  Mixed  radix  numbers  in  gslex  (generalized  subset  lex)  order,  dots  denote  zeros.  The 


radix  vectors  are  M = [2,3,4]  (left)  and  M = [4,3,2]  (right).  Successive  words  differ  in  at  most  three 
positions.  Columns  ‘x’  give  the  values. 


The  algorithm  for  the  generation  of  subsets  in  lexicographic  order  in  set  representation  given  in  sec- 
tion 8.1.2  on  page  203  can  be  generalized  for  mixed  radix  numbers.  Figure [9.3-A| shows  the  3-digit  mixed 
radix  numbers  for  base  M = [2, 3, 4]  (left)  and  M = [4,  3,  2]  (right).  Note  that  zero  is  the  last  word  in  this 
order.  For  lack  of  a better  name  we  call  the  order  gslex  (for  generalized  subset-lex ) order.  A generator 
for  the  gslex  order  is  [FXT:  class  mixedradix_gslex  in  comb/mixedradix-gslex.h  : 


1 class  mixedradix_gslex 

2 { 

3 public: 

4 ulong  n_;  //  n-digit  numbers 

5 ulong  *a_;  //  digits 

6 ulong  *ml_;  //  ml [k]  ==  radix-1  at  position  k 


9.3:  gslex  order 
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9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 


public : 

mixedradix_gslex(ulong  n,  ulong  mm,  const  ulong  *m=0) 

4 

n_  = n; 

a_  = new  ulong  [n_  + 1] ; 
a_ [n_]  =1;  //  sentinel 
ml_  = new  ulong [n_]  ; 
mixedradix_init (n_ , mm,  m,  ml_) ; 
f irst  () ; 

> 

[ — snip — ] 
void  first C) 

4 

for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = 0; 
a_[0]  = 1; 

> 

void  lastO 

{ 

for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = 0; 

} 


The  method  nextO  computes  the  successor: 


1 

bool  nextO 

2 

4 

3 

ulong  e = 0 ; 

4 

while  ( 0==a  [e]  ) 

5 

6 

if  ( e==n_  ) 4 fir; 

7 

8 

ulong  ae  = a_  [e]  ; 

9 

if  ( ae  ! = ml  [e]  ) 

10 

4 

11 

a_  [0]  = 1; 

12 

a [e]  = ae  + 1 ; 

13 

} 

14 

else 

15 

4 

16 

a_  [e]  = 0; 

17 

if  ( a [e+l]==0 

18 

4 

19 

a_  [0]  = 1; 

20 

++a  [e+1]  ; 

21 

} 

22 

} 

23 

return  true ; 

24 

> 

//  can  touch  sentinel 

return  false;  ]-  //  current  is  last 

easy  case:  simple  increment 


//  can  touch  sentinel 


The  predecessor  is  computed  by  the  method  prev(): 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 


bool  prev() 

4 

ulong  e = 0 ; 

while  ( 0==a_  [e]  ) ++e ; //  can  touch  sentinel 

if  ( 0 ! =e  ) //  easy  case:  prepend  nine 

4 

— e ; 

a_  [e]  = ml_  [e]  ; 

} 

else 

4 

ulong  aO  = a_ [0] ; 

— aO; 

a_  [0]  = aO ; 
if  ( 0==a0  ) 

4 

do  4 ++e;  } while  ( 0==a_ [e]  );  //  can  touch  sentinel 

if  ( e==n_  ) 4 lastO;  return  false;  } //  current  is  first 

ulong  ae  = a_  [e]  ; 

— ae ; 

a_ [e]  = ae ; 
if  ( 0==ae  ) 

4 


— e ; 
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27 

28 
29 
:e 

31 

32 


The  routine  works  in  constant  amortized  time  and  is  fast  in  practice.  The  worst  performance  occurs 
when  all  digits  are  radix  2,  then  about  123  million  objects  are  created  per  second.  With  radix  4 the  rate 
is  about  198  M/s,  with  radix  16  about  273  M/s  [FXT:  comb/mixedradix-gslex-demo.cc  . 


Alternative  gslex  order 


a_  [e]  = ml_  [e]  ; 

> 

} 

} 

return  true ; 
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Figure  9.3-B:  Mixed  radix  numbers  in  alternative  gslex  order,  dots  denote  zeros.  The  radix  vectors  are 


M = [2,3,4]  (left)  and  M = [4,3,2]  (right).  Successive  words  differ  in  at  most  three  positions.  Columns 
‘x’  give  the  values. 

A variant  of  the  gslex  order  is  shown  in  figure  [ib3-B|  The  ordering  can  be  obtained  from  the  gslex  order  by 
reversing  the  list,  reversing  the  words,  and  replacing  all  nonzero  digits  di  by  r,;  — di  where  r,  is  the  radix 
at  position  i.  The  implementation  is  given  in  [FXT:  class  mixedradix_gslex_alt  in  comb/mixedradix- 
gslex-alt.hj,  the  rate  of  generation  is  about  the  same  as  with  gslex  order  [FXT:  comb/mixedradix-gslex- 
alt-demo.cc  . 


9.4  endo  order 


The  computation  of  the  successor  in  mixed  radix  endo  order  (see  section  6.6.1  on  page  186)  is  very 


similar  to  the  counting  order  described  in  section  9.1  on  page  217  The  implementation  [FXT : class 
mixedradix_endo  in  comb/mixedradix-endo.h  uses  an  additional  array  le_  []  of  the  last  nonzero  elements 
in  endo  order.  Its  entries  are  2 for  rri  > 1,  else  1: 


1 class  mixedradix_endo 

2 { 

3 public: 

4 ulong  *a_;  //  digits,  sentinel  a[n] 

5 ulong  *ml_;  //  radix  (minus  one)  for  each  digit 

6 ulong  *le_;  //  last  positive  digit  in  endo  order,  sentinel  le [n] 

7 ulong  n_;  //  Number  of  digits 

8 ulong  j_;  //  position  of  last  change 

9 

10  mixedradix_endo (const  ulong  *m,  ulong  n,  ulong  mm=0) 


9.4:  endo  order 
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Figure  9. 4- A:  Mixed  radix  numbers  in  endo  order,  dots  denote  zeros.  The  radix  vector  is  M = [5,6]. 
Columns  ‘x’  give  the  values. 


11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 


4 

n_  = n; 

a_  = new  ulong[n_+l]; 

a_  [n_]  =1;  //  sentinel:  !=  0 

ml_  = new  ulong[n_]; 

mixedradix_init (n_ , mm,  m,  ml_) ; 


le_  = new  ulong  [n_+l]; 

le_  [n_]  =0;  //  sentinel:  !=  a[n] 

for  (ulong  k=0;  k<n_;  ++k)  le_  [k]  = 2 - (ml_[k]==l); 


first ()  ; 

> 

[ — snip — ] 


The  first  word  is  all  zero,  the  last  can  be  read  from  the  array  le_  [] : 

1 void  first  () 

2 4 

3 for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = 0; 

4 j_  = n_ ; 

5 > 

6 

7 void  lastO 

8 4 

9 for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = le_ [k] ; 

10  j_  = n_ ; 

11  > 

12  [ — snip — ] 


In  the  computation  of  the  successor  the  function  next_endo()  is  used  instead  of  a simple  increment: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

11 

14 

15 

16 

17 

18 

19 

20 
21 
22 


bool  next() 

4 

bool  ret  = false; 
ulong  j = 0; 

while  ( a_  [j]==le_  [j]  ) 4 a_[j]=0;  ++ j ; } //  can  touch  sentinel 

if  ( j<n_  ) //  only  if  no  overflow 

{ 

a_[j]  = next_endo  (a_  [j]  , ml_[j]);  //  increment 

ret  = true; 

} 

j-  = j; 

return  ret ; 

> 

bool  prev() 

4 

bool  ret  = false; 
ulong  j = 0; 

while  ( a_[j]==0  ) 4 a_  [j]  =le_  [j]  ; ++j  ; } //  can  touch  sentinel 
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23 

24 

25 

26 
27 


30 

31 

32 


if  ( j<n_  ) //  only  if  no  overflow 
{ 

a_[j]  = prev_endo  (a_  [j]  , ml_[j]);  //  decrement 

ret  = true; 

} 

j-  = j; 

return  ret ; 

> 

[ — snip — ] 


The  function  next()  generates  between  about  115  million  (radix  2)  and  180  million  (radix  16)  numbers 
per  second.  The  listing  in  figure  9.4-A  was  created  with  the  program  [FXT:  comb/mixedradix-endo- 
demo.cc  . 


9.5  Gray  code  for  endo  order 
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Figure  9.5-A:  Mixed  radix  numbers  in  endo  Gray  code,  dots  denote  zeros.  The  radix  vector  is  M = [4,  5]. 


Columns  ‘x’  give  the  values,  columns  ‘j’  and  ‘d’  give  the  position  of  last  change  and  its  direction, 
respectively. 


A Gray  code  for  mixed  radix  numbers  in  endo  order  is  a modification  of  the  CAT  algorithm  for  the  Gray 
code  described  in  section  9.2  on  page  220|  [FXT:  class  mixedradix_endo_gray  in  comb/mixedradix- 
endo-gray.h  : 
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2 
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class  mixedradix_endo_gray 
{ 

public : 

ulong  *a_;  //  mixed  radix  digits 

ulong  *ml_;  //  radices  (minus  one) 
ulong  *i_;  //  direction 

ulong  *le_;  //  last  positive  digit  in  endo  order 

ulong  n_;  //  n_  digits 

ulong  j_;  //  position  of  last  change 

int  dm_;  //  direction  of  last  move 

[ — snip — ] 
void  first () 

{ 

for  (ulong  k=0;  k<n_;  ++k)  a_  [k]  = 0; 

for  (ulong  k=0;  k<n_;  ++k)  i_  [k]  = +1; 

j-  = 

dm_  = 0 ; 

> 


In  the  computation  of  the  last  number  the  digits  from  the  last  even  radix  to  the  end  have  to  be  set  to 
the  last  digit  in  endo  order: 


1 void  last() 

2 { 

3 

4 


for  (ulong  k=0;  k<n_;  ++k)  a_ [k]  = 0; 
for  (ulong  k=0;  k<n_;  ++k)  i_ [k]  = -1UL; 


9.6:  Fixed  sum  of  digits 
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6 

// 

find  position  of  last  even  radix: 

7 

ulong  z = 0 ; 

8 

for 

• (ulong  i=0;  i<n_;  ++i)  if  ( ml_ 

9 

while  ( z<n_  ) //  last  even  . . end: 

10 

{ 

11 

a_  [z]  = le_  [z]  ; 

12 

i_[z]  = +1; 

13 

++z ; 

14 

} 

j- 

= 0; 

17 

dm 

= -1; 

18 

} 

19 

[ — snip — 

] 

The  successor  is  computed  as  follows: 
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2 
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21 

22 

23 
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25 

26 
27 


30 

31 

32 

33 


bool  next() 

{ 

along  j = 0; 
along  i j ; 

while  ( (ij=i_[j])  ) //  can  touch  sentinel  i [n] ==0 

{ 

along  dj ; 

bool  ovq;  //  overflow? 
if  ( ij  ==  1 ) 

{ 

dj  = next_endo(a_  [j]  , ml_[j]); 
ovq  = (dj==0); 

} 

else 

{ 

ovq  = (a_ [j] ==0) ; 

dj  = prev_endo(a_  [j]  , ml_[j]); 

} 

if  ( ovq  ) i_  [ j ] = -ij ; 

else 

{ 

a_  [ j ] = d j ; 
dm_  = ij; 

j-  = j; 

return  true; 

} 

++j ; 

} 

return  false; 

} 

[ — snip — ] 


For  the  routine  for  computation  of  the  predecessor  change  the  test  if  ( i j ==  1 ) to  if 
About  65  million  (radix  2)  and  110  million  (radix  16)  numbers  per  second  are  generated, 
figure  9.5-A  was  created  with  the  program  [FXT:  comb/mixedradix-endo-gray-demo.cc|. 


( ij  !=  1 ). 

The  listing  in 


9.6  Fixed  sum  of  digits 


Mixed  radix  numbers  with  sum  of  digits  4 in  lexicographic  order  are  shown  in  figure |9.6-A|  The  numbers 
in  falling  factorial  base  correspond  to  length-6  permutations  with  5 inversions  (left,  see  section  10.1.1), 
the  radix-4  numbers  correspond  to  compositions  of  4 into  4 parts  of  size  at  most  3 (middle,  see  section 


7.1 


on  page 


295 


194),  and  the  binary  numbers  correspond  to  combinations  (J)  (right,  see  section 


6.2 


1771.  The  numbers  also  correspond  to  the  fc-subsets  (combinations)  of  multisets,  see  section  13.1 


The  listings  were  created  with  the  program  [FXT:  comb/mixedradix-sod-lex-demo.cc  . 


on  page 
on  page 


The  successor  is  computed  by  determining  the  position  j of  the  leftmost  nonzero  digit  whose  right  neighbor 
can  be  incremented.  After  the  increment  the  digits  at  positions  up  to  j are  set  to  the  (lexicographically) 
first  string  such  that  the  sum  of  digits  is  preserved.  Sentinels  are  used  with  the  scans  [FXT:  class 
mixedradix_sod_lex  in  comb/mixedradix-sod-lex.h  : 

1 class  mixedradix_sod_lex 
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Figure  9.6-A:  Mixed  radix  numbers  with  sum  of  digits  4 in  lexicographic  order:  5-digit  falling  factorial 


base  (left),  4-digit  radix  4 (middle),  and  7-digit  binary  (right). 
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{ 

public : 

ulong  *a_ ; 
ulong  *ml_; 
ulong  n_; 
ulong  s_; 
along  j_ ; 
ulong  sm_ ; 


//  digits 

//  nines  (radix  minus  one)  for  each  digit 
//  Number  of  digits 
//  Sum  of  digits 

//  rightmost  position  of  last  change 

//  max  possible  sum  of  digits  (arg  s with  first ()) 


public : 

mixedradix_sod_lex (ulong  n,  ulong  mm,  const  ulong  *m=0) 

{ 

n_  = n; 

a_ [n_]  =1;  //  sentinel  ! =0 

ml_  [n_]  =2;  //  sentinel  >a[n] 

a_[n_+l]  =0;  //  sentinel  ==0 

ml_[n_+l]  =1;  //  sentinel  >0 


mixedradix_init (n_ , mm,  m,  ml_) ; 


ulong  s = 0; 

for  (ulong  i=0;  i<n_;  ++i)  s +=  ml_ [i] ; 
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24  sm_  = s ; 

27  > 

28  [ — snip — ] 

The  sum  of  digits  is  supplied  with  the  method  first  (): 
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bool  first (ulong  k) 

1 

s_  = k; 

if  ( s_  > sm_  ) return  false;  //  too  big 

ulong  i = 0 ; 
ulong  s = s_; 
while  ( s ) 

{ 

const  ulong  ml  = ml_[i]; 

if  ( s >=  ml  ) { a_[i]  = ml;  s -=  ml;  } 

else  { a_[i]  = s;  break;  } 

++i ; 

} 

while  ( ++i<n_  ) { a_[i]  =0;  } 

j_  = n_  - 1; 
return  true ; 

} 

bool  nextO 

1 

ulong  j = 0; 
ulong  s = 0; 

while  ( (a_[j]==0)  II  (a_  [j  + 1]  ==ml_  [j+1] ) ) //  can  read  sentinels 

{ 

s +=  a_  [j]  ; 
a_ [j] =0; 

++j ; 

} 

j_  = j+1;  //  record  rightmost  position  of  change 

if  ( j_  >=  n_  ) return  false;  //  current  is  last 

s +=  (a_  [j]  - 1)  ; 
a_[j]  = 0; 

++a_[j  + l];  //  increment  next  digit 

ulong  i = 0 ; 

do  //  set  prefix  to  lex-first  string 

{ 

const  ulong  ml  = ml_[i]; 

if  ( s >=  ml  ) { a_[i]  = ml;  s -=  ml;  } 

else  { a_[i]  = s;  s = 0;  } 

++i ; 

} 

while  ( s ) ; 
return  true ; 

> 

[ — snip — ] 

}; 
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Chapter  10 

Permutations 


We  present  algorithms  for  the  generation  of  all  permutations  in  various  orders  such  as  lexicographic  and 
minimal-change  order.  Several  methods  to  convert  permutations  to  and  from  mixed  radix  numbers  with 
factorial  base  are  described.  Algorithms  for  application,  inversion,  and  composition  of  permutations  and 
for  the  generation  of  random  permutations  are  given  in  chapter  [2] 

10.1  Factorial  representations  of  permutations 

The  factorial  number  system  corresponds  to  the  mixed  radix  bases  M = [2, 3, 4, . . .]  ( rising  factorial  base) 
or  M = [. . . , 4,  3,  2]  ( falling  factorial  base).  A factorial  number  with  ( n — l)-digits  can  have  n!  different 
values.  We  develop  different  methods  to  convert  factorial  numbers  to  permutations  and  vice  versa. 

10.1.1  The  Lehmer  code  (inversion  table) 

Each  permutation  of  n elements  can  be  converted  to  a unique  (n  — l)-digit  factorial  number  A = 
[ao,  ai, . . . , an-2\  in  the  falling  factorial  base:  for  each  index  k (except  the  last)  count  the  number  of 
elements  with  indices  to  the  right  of  k that  are  less  than  the  current  element  [FXT:  comb/fact2perm.cc  : 


10 
11 
12 
13 

The  routine  works  because  all  elements  of  the  permutation  are  distinct.  The  factorial  representation 
computed  is  called  the  Lehmer  code  of  the  permutation.  For  example,  the  permutation  P = [3,0, 1,4,2] 
has  the  inversion  table  I = [3, 0,  0, 1]:  three  elements  less  than  the  first  element  (3)  lie  to  the  right  of  it, 
no  elements  less  than  the  second  (0)  or  third  (1)  elements  lies  right  to  them,  and  one  element  less  than 
4 lies  right  of  it. 

An  alternative  term  for  the  Lehmer  code  is  inversion  table:  an  inversion  of  a permutation 

[x0,  xi,  ■ z„_i]  (10.1-1) 

is  a pair  of  indices  k and  j where  k < j and  Xj  < Xu-  Now  fix  k and  call  such  an  inversion  (where  an 
element  Xj  right  of  k is  less  than  Xk)  a right  inversion  at  k.  The  inversion  table  [z0,  ii,  . . . , in_ 2]  of  a 
permutation  is  computed  by  setting  ik  to  the  number  of  right  inversions  at  k.  This  is  exactly  what  the 
given  routine  does. 

A routine  that  computes  the  permutation  for  a given  Lehmer  code  is 


void  perm2f fact (const  ulong  *x,  ulong  n,  ulong  *fc) 

//  Convert  permutation  in  x[0,...,n-l]  into 

//  the  (n-1)  digit  falling  factorial  representation  in  f c [0 , . . . ,n-2] . 
//  We  have:  fc[0]<n,  fc[l]<n-l,  ....  fc[n-2]<2  (falling  radices) 

{ 

for  (ulong  k=0;  k<n-l;  ++k) 

1 

ulong  xk  = x [k] ; 
ulong  i = 0; 

for  (ulong  j=k+l;  j<n;  ++j)  if  ( x[j]<xk  ) ++i; 

fc[k]  = i; 

> 
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1 void  f f act2perm(const  ulong  *fc,  ulong  n,  ulong  *x) 

2 //  Inverse  of  perm2f f act () : 

3 //  Convert  the  (n-1)  digit  falling  factorial  representation  in  f c [0, . . . ,n-2] . 

4 //  into  permutation  in  x[0,...,n-l] 

5 //  Must  have:  fc[0]<n,  fc[l]<n-l,  fc[n-2]<2  (falling  radices) 

6 { 

7 for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

8 for  (ulong  k=0;  k<n-l;  ++k) 

9 { 

10  ulong  i = fc[k]; 

11  if  ( i ) rotate_rightl(x+k,  i+1) ; 

12  } 

13  } 

A routine  to  compute  the  inverse  permutation  from  the  Lelrmer  code  is 

1 void  f f act2invperm(const  ulong  *fc,  ulong  n,  ulong  *x) 

2 //  Convert  the  (n-1)  digit  falling  factorial  representation  in  f c [0, . . . ,n-2] 

3 //  into  permutation  in  x[0,...,n-l]  such  that 

4 //  the  permutation  is  the  inverse  of  the  one  computed  via  ffact2perm() . 

5 { 

6 for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

7 for  (ulong  k=n-2;  (long)k>=0;  — k) 

8 { 

9 ulong  i = fc[k]; 

10  if  ( i ) rotate_leftl (x+k,  i+1); 

11  } 

12  } 
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Figure  10.1-A:  Numbers  in  falling  factorial  base  and  permutations  so  that  the  number  is  the  Lehrner 


code  of  it  (left  columns).  Dots  denote  zeros.  The  rising  factorial  representation  of  the  reversed  and 
complemented  permutation  equals  the  reversed  Lehrner  code  (right  columns). 


A similar  method  can  compute  a representation  in  the  rising  factorial  base.  We  count  the  number  of 
elements  to  the  left  of  k that  are  greater  than  the  element  at  k (the  number  of  left  inversions  at  k ): 

1 
2 

3 

4 

5 

6 

7 

8 
9 


void  perm2rf act (const  ulong  *x,  ulong  n,  ulong  *fc) 

//  Convert  permutation  in  x[0,...,n-l]  into 

//  the  (n-1)  digit  rising  factorial  representation  in  f c [0, . . . ,n-2] . 
//  We  have:  fc[0]<2,  fc[l]<3,  ....  fc[n-2]<n  (rising  radices) 

{ 

for  (ulong  k=l;  k<n;  ++k) 

{ 

ulong  xk  = x [k] ; 
ulong  i = 0 ; 
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Figure  10.1-B:  Numbers  in  rising  factorial  base  and  permutations  so  that  the  number  is  the  Lehmer 


code  of  it  (left  columns).  The  reversed  and  complemented  permutations  and  their  falling  factorial  repre- 
sentations are  shown  in  the  right  columns.  They  appear  in  lexicographic  order. 


10  for  (ulong  j =0 ; j<k;  ++j)  if  ( x[j]>xk  ) ++i ; 

11  fc[k-l]  = i; 

12  } 

13  } 


The  inverse  routine  is 


1 void  rf act2perm(const  ulong  *fc,  ulong  n,  ulong  *x) 

2 { 

3 for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

4 ulong  *y  = x+n; 

5 for  (ulong  k=n-l;  k!=0;  — k,  — y) 

6 { 

7 ulong  i = fc[k-l]; 

8 if  ( i ) { ++i;  rotate_lef tl (y-i , i) ; } 

9 } 

10  } 


A routine  for  the  inverse  permutation  is 


1 void  rf act2invperm(const  ulong  *fc,  ulong  n,  ulong  *x) 

2 //  Convert  the  (n-1)  digit  rising  factorial  representation  in  f c [0, . . . ,n-2] . 

3 //  into  permutation  in  x[0,...,n-l]  such  that 

4 //  the  permutation  is  the  inverse  of  the  one  computed  via  rfact2perm() . 

5 { 

6 for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

7 ulong  *y  = x + 2; 

8 for  (ulong  k=0;  k<n-l;  ++k,  ++y) 

9 { 

10  ulong  i = fc[k]; 

11  if  ( i ) { ++i;  rotate_rightl (y-i , i) ; } 

12  } 

13  } 


The  permutations  corresponding  to  the  Lehmer  codes  (in  counting  order)  are  shown  in  figure  10.1- A 


(left  columns)  which  was  created  with  the  program  [FXT:  comb/fact2perm-demo.cc|.  The  permutation 
whose  rising  factorial  representation  is  the  digit-reversed  Lehmer  code  is  computed  by  reversing  and 
complementing  (replacing  each  element  x by  n — 1 — x)  the  original  permutation: 


Lehmer  code  permutation  rev. perm  compl. rev. perm  rising  fact 

[3,0,0,11  [3,0,1,4,21  [2,4,1,0,31  [2,0,3,4,11  [1,0,0,31 
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The  permutations  obtained  from  counting  in  the  rising  factorial  base  are  shown  in  figure  |10.1-B 
10.1.1.1  Computation  with  large  arrays 

With  the  left-right  array  described  in  section  |4.7|  on  page  |166|  the  conversion  to  and  from  the  Lehmer 
code  can  be  done  in  O ( n logn)  operations  [FXT:  comb/big-fact2perm.cc  : 

void  perm2f fact (const  ulong  *x,  ulong  n,  ulong  *fc,  lef t_right_array  &LR) 

{ 

LR. set_all() ; 

for  (ulong  k=0;  k<n-l;  ++k) 

//  i :=  number  of  Set  positions  Left  of  x [k] , Excluding  x[k]. 
ulong  i = LR.num_SLE(  x[k]  ); 

LR. get_set_idx_chg(  i ); 
fc[k]  = i; 

} 

} 

The  LR-array  passed  as  an  extra  argument  has  to  be  of  size  n.  Conversion  of  an  array  of,  say,  10  million 
entries  is  a matter  of  seconds  if  this  routine  is  used  [FXT:  comb/big-fact2perm-demo.cc  . 

void  f f act2perm(const  ulong  *fc,  ulong  n,  ulong  *x,  lef t_right_array  &LR) 

{ 

LR. free_all() ; 

for  (ulong  k=0;  k<n-l;  ++k) 

ulong  i = LR.get_free_idx_chg(  fc[k]  ); 
x [k]  = i ; 

} 

ulong  i = LR.get_free_idx_chg(  0 ); 
x[n-l]  = i; 

} 

The  routines  for  rising  factorials  are 

void  perm2rf act (const  ulong  *x,  ulong  n,  ulong  *fc,  left_right_array  &LR) 

{ 

LR. set_all() ; 

for  (ulong  k=0,  r=n-l;  k<n-l;  ++k,  — r)  //  r ==  n-l-k; 

//  i :=  number  of  Set  positions  Left  of  x [r]  , Excluding  x[r]. 
ulong  i = LR.num_SLE(  x[r]  ); 

LR. get_set_idx_chg(  i ); 
fc[r-l]  = r - i; 

} 

} 

and 

void  rf act2perm(const  ulong  *fc,  ulong  n,  ulong  *x,  left_right_array  &LR) 

{ 

LR. free_all() ; 

for  (ulong  k=0;  k<n-l;  ++k) 

{ 

ulong  i = LR.get_free_idx_chg(  fc[n-2-k]  ); 
x [n-l-k]  = n-l-i; 

} 

ulong  i = LR.get_free_idx_chg(  0 ); 
x[0]  = n-l-i; 

} 

The  conversion  of  the  routines  that  compute  permutations  from  factorial  numbers  into  routines  that 
compute  the  inverse  permutations  is  especially  easy,  just  change  the  code  as  follows: 

x [a]  = b;  = — > x [b]  = a; 

We  obtain  the  routines 

void  f f act2invperm(const  ulong  *fc,  ulong  n,  ulong  *x,  left_right_array  &LR) 

{ 

LR. free_all() ; 

for  (ulong  k=0;  k<n-l;  ++k) 

ulong  i = LR.get_free_idx_chg(  fc[k]  ); 
x [i]  = k; 
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8 } 

9 ulong  i = LR.get_free_idx_chg(  0 ); 

10  x[i]  = n-1; 

11  } 

and 

1 void  rf act2invperm(const  ulong  *fc,  ulong  n,  ulong  *x,  left_right_array  &LR) 

2 { 

3 LR.free_all()  ; 

4 for  (ulong  k=0;  k<n-l;  ++k) 

5 { 

6 ulong  i = LR.get_free_idx_chg(  fc[n-2-k]  ); 

7 x[n-l-i]  = n-l-k; 

8 > 

9 ulong  i = LR.get_free_idx_chg(  0 ); 

10  x[n-l-i]  = 0; 

11  } 


10.1.1.2  The  number  of  inversions 

The  number  of  inversions  of  a permutation  can  be  computed  as  follows  [FXT:  perm/permq.cc  : 

1 ulong 

2 count_inversions (const  ulong  *f , ulong  n) 

3 //  Return  number  of  inversions  in  f [] , 

4 //  i.e.  number  of  pairs  k,j  where  k<j  and  f[k]>f[j] 

5 { 

6 ulong  ct  = 0; 

7 for  (ulong  k=l;  k<n;  ++k) 

8 1 

9 ulong  f k = f [k] ; 

10  for  (ulong  j =0 ; j<k;  ++j)  ct  +=  ( fk<f[j]  ); 

11  > 

12  return  ct ; 

13  } 

The  algorithm  is  0(n2).  For  large  arrays  we  can  use  the  fact  that  the  number  of  inversions  equals  the 
sum  of  digits  of  the  Lehmer  code,  the  algorithm  is  O ( n logn): 

1 ulong 

2 count_inversions (const  ulong  *f,  ulong  n,  left_right_array  *tLR) 

3 { 

4 left_right_array  *LR  = tLR; 

5 if  ( tLR==0  ) LR  = new  lef t_right_array (n) ; 

? ulong  ct  = 0; 

8 LR->set_all() ; 


9 

for 

(ulong  k=0;  k<n-l;  ++k) 

10 

{ 

11 

ulong  i = LR->num_SLE(  f [k] 

12 

LR->get_set_idx_chg(  i ) ; 

13 

ct  +=  i ; 

14 

} 

15 

16 

if 

( tLR==0  ) delete  LR; 

17 

return  ct ; 

18  } 

10.1.2  A representation  via  reversals  f 

Replacing  the  rotations  in  the  computation  of  a permutation  from  its  Lehmer  code  by  reversals  gives 
a different  one-to-one  relation  between  factorial  numbers  and  permutations.  The  routine  for  the  falling 
factorial  base  is  [FXT:  comb/fact2perm-rev.cc  : 

1 void  perm2ffact_rev (const  ulong  *x,  ulong  n,  ulong  *fc) 

2 { 

3 ALL0CA(ulong,  ti,  n) ; //  inverse  permutation 

4 for  (ulong  k=0;  k<n;  ++k)  ti[x[k]]  = k; 

5 for  (ulong  k=0;  k<n-l;  ++k) 

6 { 

7 ulong  j;  //  find  element  k 

8 for  (j=k;  j<n;  ++j)  if  ( ti[j]==k  ) break; 

9 j — k; 
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Figure  10.1-C:  Numbers  in  falling  (top)  and  rising  (bottom)  factorial  base  and  permutations  so  that 


the  number  is  the  alternative  (reversal)  code  of  it  (left  columns).  The  inverse  permutations  and  their 
factorial  representations  are  shown  in  the  right  columns.  Dots  denote  zeros. 
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f c [k]  = j ; 
reverse(ti+k,  j + 1)  ; 

> 

} 


The  routine  is  the  inverse  of 


void  ffact2perm_rev (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 
for  (ulong  k=0;  k<n-l;  ++k) 

{ 

ulong  i = fc[k]; 

//  Lehmer:  rotate_rightl (x+k,  i+1) ; 
if  ( i ) reverse (x+k,  i+1); 

} 


Figure  10.1-C  shows  the  permutations  of  4 elements  and  their  factorial  representations.  It  was  created 
with  the  program  [FXT:  comb/fact2perm-rev-demo.cc  . The  routines  for  the  rising  factorial  base  are 


void  perm2rfact_rev (const  ulong  *x,  ulong  n,  ulong  *fc) 

{ 

ALL0CA(ulong,  ti,  n) ; //  inverse  permutation 

for  (ulong  k=0;  k<n;  ++k)  ti[x[k]]  = k; 
for  (ulong  k=n-l;  k!=0;  — k) 

■C 

ulong  j ; //  find  element  k 

for  (j=0;  j<=k;  ++j)  if  ( ti[j]==k  ) break; 

j = k - j; 

fc[k-l]  = j; 

reverse (ti+k-j , j+1); 

> 

} 

and 


void  rf act2perm_rev (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 
ulong  *y  = x+n; 

for  (ulong  k=n-l;  k!=0;  — k,  — y) 

{ 

ulong  i = fc[k-l]; 
if  ( i ) 

{ 

++i ; 

//  Lehmer:  rotate_leftl (y-i , i) ; 
reverse (y-i,  i) ; 

> 

} 

} 


10.1.3  A representation  via  rotations  f 

To  compute  permutations  from  the  Lehmer  code  we  used  rotations  by  one  position  of  length  determined 
by  the  digits.  If  we  fix  the  length  and  let  the  amount  of  rotation  be  the  value  of  the  digits,  we  obtain 
two  more  methods  to  compute  permutations  from  factorial  numbers  [FXT;  comb/fact2perm-rot.cc  : 

void  ffact2perm_rot (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

for  (ulong  k=0,  len=n;  k<n-l;  ++k,  — len) 

I 

ulong  i = fc[k]; 
rotate_left (x+k,  len,  i) ; 

} 

} 

void  rf act2perm_rot (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

for  (ulong  k=n-2,  len=n;  len>l;  — k,  — len) 

■C 


ulong  i = fc[k]; 


CO  00  C5  OT  4^  CO  to  I—1  cooo~4 
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Figure  10.1-D:  Falling  (left)  and  rising  (right)  factorial  numbers  and  permutations  via  rotation  code. 


} 


rotate_lef t (x+n-len,  len,  i) ; 

> 


10 

11 


Figure  |10.1-D|  shows  the  permutations  of  4 elements  corresponding  to  the  falling  and  rising  factorial 
numbers  in  lexicographic  order  [FXT:  comb/fact2perm-rot-demo.cc  . The  second  half  of  the  inverse 
permutations  is  the  reversed  permutations  in  the  first  half  in  reversed  order.  The  columns  of  the  inverse 
permutations  with  the  falling  factorials  are  cyclic  shifts  of  each  other,  see  section  |10.12|  on  page  |271|  for 
more  orderings  with  this  property. 

The  routines  to  compute  the  factorial  representation  of  a given  permutation  are 

void  perm2ffact_rot (const  ulong  *x,  ulong  n,  ulong  *fc) 

{ 

ALLOCA (ulong,  t,  n) ; 

for  (ulong  k=0;  k<n;  ++k)  t [x [k] ] = k;  //  inverse  permutation 
for  (ulong  k=0;  k<n-l;  ++k) 

{ 

ulong  s = 0;  while  ( t [k+s]  !=  k ) ++s; 

if  ( s ! =0  ) rotate_left (t+k,  n-k,  s) ; 
fc[k]  = s; 

> 


and 

void  perm2rfact_rot (const  ulong  *x,  ulong  n,  ulong  *fc) 

{ 

ALLOCA (ulong,  t,  n) ; 

for  (ulong  k=0;  k<n;  ++k)  t[x[k]]  = k;  //  inverse  permutation 

for  (ulong  k=0;  k<n-l;  ++k) 

{ 

ulong  s = 0;  while  ( t [k+s]  !=  k ) ++s; 

if  ( s ! =0  ) rotate_left(t+k,  n-k,  s) ; 
fc[n-2-k]  = s; 

> 


10.1.4  A representation  via  swaps 


The  following  routines  compute  factorial  representations  via  swaps,  the  method  is  adapted  from  [258] . 
The  complexity  of  the  direct  implementation  is  O(n)  [FXT:  comb/fact2perm-swp.cc  : 
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Figure  10.1-E:  Numbers  in  falling  (top)  and  rising  (bottom)  factorial  base  and  permutations  so  that  the 


number  is  the  alternative  (swaps)  code  of  it  (left  columns).  The  inverse  permutations  and  their  factorial 
representations  are  shown  in  the  right  columns.  Dots  denote  zeros. 


10.1:  Factorial  representations  of  permutations 
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void  perm2ffact_swp (const  ulong  *x,  ulong  n,  ulong  *fc) 

{ 

ALLOCA (ulong,  t,  n) ; 

for  (ulong  k=0;  k<n;  ++k)  t [k]  = x [k] ; 

ALL0CA(ulong,  ti,  n) ; //  inverse  permutation 

for  (ulong  k=0;  k<n;  ++k)  ti[t[k]]  = k; 

for  (ulong  k=0;  k<n-l;  ++k) 

1 

ulong  tk  = t [k] ; //  >=  k 

fc[k]  = tk  - k; 

ulong  j = ti[k];  //  location  of  element  k,  j>=k 
ti  [tk]  = j ; 
t[j]  = tk; 

} 


void  perm2rfact_swp(const  ulong  *x,  ulong  n,  ulong  *fc) 

{ 

ALLOCA (ulong,  t,  n) ; 

for  (ulong  k=0;  k<n;  ++k)  t [k]  = x [k] ; 

ALL0CA(ulong,  ti,  n) ; //  inverse  permutation 

for  (ulong  k=0;  k<n;  ++k)  ti[t[k]]  = k; 

for  (ulong  k=0;  k<n-l;  ++k) 

4 

ulong  j = t i [k] ; //  location  of  element  k,  j>=k 

fc[n-2-k]  = j - k; 

ulong  tk  = t [k] ; //  >=k 

ti  [tk]  = j ; 

t[j]  = tk; 

} 

} 

Their  inverses  also  have  linear  complexity,  and  no  additional  memory  is  needed.  The  routine  for  falling 
base  is 

void  ffact2perm_swp (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 
for  (ulong  k=0;  k<n-l;  ++k) 

{ 

ulong  i = f c [k] ; 
swap2(  x [k]  , x[k+i]  ); 

> 


The  routine  for  the  rising  base  is 


void  rfact2perm_swp (const  ulong  *fc,  ulong  n,  ulong  *x) 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 
for  (ulong  k=0,j=n-2;  k<n-l;  ++k, — j) 

f 

ulong  i = fc[k]; 
swap2(  x[j]  , x[j+i]  ) ; 

} 


The  permutations  corresponding  to  the  alternative  codes  for  the  falling  base  are  shown  in  figure  |10.1-E| 
(left  columns,  top).  The  inverse  permutation  has  the  rising  factorial  representation  that  is  digit-reversed 
(right  columns) . The  permutations  corresponding  to  the  alternative  codes  for  rising  base  are  shown  at  the 
bottom  of  figure  10.1-E  The  listings  were  created  with  the  program  [FXT:  comb/fact2perm-swp-demo.cc  . 
The  inverse  permutations  can  be  computed  by  applying  the  swaps  (which  are  self-inverse)  in  reversed 
order,  the  routines  are 


void  ffact2invperm_swp (const  ulong  *fc,  ulong  n,  ulong  *x) 
//  Generate  inverse  permutation  wrt . ffact2perm_swp() . 

{ 

for  (ulong  k=0;  k<n;  ++k)  x[k]  = k; 

if  ( n<=l  ) return; 

ulong  k = n-2; 

do 

{ 


242 


Chapter  10:  Permutations 


9 ulong  i = fc[k]; 

10  swap2(  x [k]  , x[k+i]  ); 

11  > 

12  while  ( k — ) ; 

13  } 

and 

1 void  rfact2invperm_swp (const  ulong  *fc,  ulong  n,  ulong  *x) 

2 //  Generate  inverse  permutation  wrt . rfact2perm_swp() . 

3 { 


4 

for 

(ulong  k=0;  k<n;  ++k) 

5 

if 

( n<=l  ) return; 

6 

ulong  k = n-2,  i=0; 

7 

do 

8 

9 

ulong  i = fc[k]; 

10 

swap2(  x[j],  x [j+i]  ) 

11 

++j ; 

12 

> 

13 

while  ( k — ) ; 

14 

} 

The  routines  can  serve  as  a means  to  find  interesting  orders  for  permutations.  Indeed,  the  permutation 
generator  shown  in  section  10.4  on  page  245  was  found  this  way.  A recursive  algorithm  for  the  (inverse) 
permutations  shown  at  the  lower  right  of  figure  [lO.l-E  is  given  in  section  11.4.1  on  page  285 


10.2  Lexicographic  order 
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Figure  10.2-A:  All  permutations  of  4 elements  in  lexicographic  order,  their  inverses,  the  complements 


of  the  inverses,  and  the  reversed  permutations.  Dots  denote  zeros. 


The  permutations  in  lexicographic  order  appear  as  if  (read  as  numbers  and)  sorted  numerically  in  as- 
cending order,  see  figure  |10.2-A[  The  first  half  of  the  inverse  permutations  are  the  reversed  inverse 
permutations  in  the  second  half:  the  position  of  zero  in  the  first  half  of  the  inverse  permutations  lies  in 
the  first  half  of  each  permutation,  so  their  reversal  gives  the  second  half.  Write  I for  the  operator  that 
inverts  a permutation,  C for  the  complement,  and  R for  reversal.  Then  we  have 

C = IRI  (10.2-1) 

and  thereby  the  first  half  of  the  permutations  are  the  complements  of  the  permutations  in  the  second 
half.  An  implementation  of  an  iterative  algorithm  is  [FXT:  class  permdLex  in  comb/perm-lex.h  . 


10.3:  Co-lexicographic  order 
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1 

class  perm  lex 

2 

{ 

3 

public : 

4 

ulong  *p_ ; //  permutation  in  0,  1, 

....  n-1,  sentinel  at  [-1] 

5 

a 

ulong  n_;  //  number  of  elements  to 

permute 

% 

public : 

8 

perm  lex (ulong  n) 

9 

1 

10 

n_  = n; 

11 

p_  = new  ulong [n_+l] ; 

12 

p_ [0]  =0;  //  sentinel 

13 

++P_; 

14 

first () ; 

15 

} 

16 

17 

~perm_lex()  { — p_;  delete  []  p_ ; } 

18 

19 

void  first ()  { for  (ulong  i=0;  i<n_ 

; i++)  p_[i]  = i;  } 

20 

21 

const  ulong  *data()  const  { return 

p;  > 

22 

[ — snip — ] 

The  method  nextO  computes  the  next  permutation  with  each  call.  The  routine 

based  on  code  by  Glenn  Rhoads 

1 

bool  nextO 

2 

3 

//  find  rightmost  pair  with  p_[i 

] < p_  [i+1]  : 

4 

const  ulong  nl  = n_  - 1 ; 

5 

ulong  i = nl ; 

6 

do  { — i;  } while  ( p_[i]  > p_ 

[i+1]  ); 

7 

Q 

if  ( (long)i<0  ) return  false; 

//  last  sequence  is  falling  seq. 

O 

9 

//  find  rightmost  element  p[j]  less  than  p [i] : 

10 

ulong  j = nl ; 

11 

while  ( p_[i]  > p_[j]  ) 1 — j; 

} 

12 

13 

swap2(p_  [i]  , p_  C j ] ) ; 

14 

15 

//  Here  the  elements  p[i+l],  ... 

, p[n-l]  are  a falling  sequence. 

16 

//  Reverse  order  to  the  right : 

17 

ulong  r = nl ; 

18 

ulong  s = i + 1 ; 

19 

on 

while  ( r > s ) { swap2(p_[r], 

p_[s]);  — r;  ++s;  } 

2? 

return  true ; 

22 

> 

Using  the  class  is  no  black  magic  [FXT:  comb/perm-lex-demo. cc  : 

ulong  n = 4; 
perm_lex  P(n); 
do 

4 

//  visit  permutation 

> 

while  ( P.nextO  ); 

The  routine  generates  about  130  million  permutations  per  second.  A faster  algorithm  is  obtained  by 


modifying  the  update  operation  for  the  co-lexicographic  order  (section  10.3)  on  the  right  end  of  the 
permutations  [FXT:  comb/perm-lex2.h|.  The  rate  of  generation  is  about  180  M/s  when  arrays  are  used 
and  about  305  M/s  with  pointers  [FXT:  comb/perm- lex2-demo.cc  . 

The  routine  for  computing  the  successor  can  easily  be  adapted  for  permutations  of  a multiset,  see  section 
|13.2.2|on  page[ 


10.3  Co-lexicographic  order 


Figure  10.3-A|  shows  the  permutations  of  4 elements  in  co-lexicographic  (colex)  order.  An  algorithm  for 
the  generation  is  implemented  in  [FXT:  class  perm_colex  in  comb/perm-colex.h  : 
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permutation 

rf  act 

inv . 

perm. 

0 

[ 3 

2 

1 

. ] 

[ . 

. ] 

[ 3 

2 

i . i 

1 

[ 2 

3 

1 

. ] 

[ 1 

. ] 

[ 3 

2 

. 1 1 

2 

[ 3 

1 

2 

. ] 

[ . 

1 

. ] 

[ 3 

1 

2 . ] 

3 

[ 1 

3 

2 

. ] 

[ 1 

1 

. ] 

[ 3 

2 1 ] 

4 

[ 2 

1 

3 

. ] 

[ . 

2 

. ] 

[ 3 

i 

. 2 ] 

5 

[ 1 

2 

3 

. ] 

[ 1 

2 

. ] 

[ 3 

1 2 ] 

6 

[ 3 

2 

1 ] 

[ . 

1 ] 

[ 2 

3 

1 . ] 

7 

[ 2 

3 

1 ] 

[ 1 

1 ] 

[ 2 

3 

. 1 ] 

8 

[ 3 

2 

1 ] 

[ . 

i 

1 ] 

[ 1 

3 

2 . ] 

9 

[ • 

3 

2 

1 ] 

[ 1 

1 

1 ] 

[ . 

3 

2 1 ] 

10 

[ 2 

3 

1 ] 

[ . 

2 

1 ] 

[ 1 

3 

. 2 ] 

11 

[ • 

2 

3 

1 ] 

[ 1 

2 

1 ] 

[ . 

3 

1 2 ] 

12 

[ 3 

1 

2 ] 

[ . 

2 ] 

[ 2 

1 

3 . ] 

13 

[ 1 

3 

2 ] 

[ 1 

2 ] 

[ 2 

3 1 ] 

14 

[ 3 

i 

2 ] 

[ . 

i 

2 ] 

[ 1 

2 

3 . ] 

15 

[ • 

3 

1 

2 ] 

[ 1 

1 

2 ] 

[ . 

2 

3 1 ] 

16 

[ 1 

3 

2 ] 

[ . 

2 

2 ] 

[ 1 

3 2 ] 

17 

[ • 

i 

3 

2 ] 

[ 1 

2 

2 ] 

[ • 

i 

3 2 ] 

18 

[ 2 

1 

3 ] 

[ . 

3 ] 

[ 2 

1 

. 3 ] 

19 

[ 1 

2 

3 ] 

[ 1 

3 ] 

[ 2 

1 3 ] 

20 

[ 2 

i 

3 ] 

[ . 

i 

3 ] 

[ 1 

2 

. 3 ] 

21 

[ • 

2 

1 

3 ] 

[ 1 

1 

3 ] 

[ . 

2 

1 3 ] 

22 

[ 1 

2 

3 ] 

[ . 

2 

3 ] 

[ 1 

2 3 ] 

23 

[ • 

i 

2 

3 ] 

[ 1 

2 

3 ] 

[ . 

i 

2 3 ] 

Figure  10.3-A:  The  permutations  of  4 elements  in  co-lexicographic  order.  Dots  denote  zeros. 


1 

2 

3 

4 

5 

6 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 


21 

22 

23 

24 

25 


class  perm_colex 
{ 

public : 

ulong  *d_ ; //  mixed  radix  digits  with  radix  = [2,  3,  4,  ...] 

ulong  *x_ ; //  permutation 

ulong  n_;  //  permutations  of  n elements 

public : 

perm_colex (ulong  n) 

//  Must  have  n>=2 

4 

n_  = n; 

d_  = new  ulong  [n_] ; 
d_[n-l]  =0;  //  sentinel 

x_  = new  ulong  [n_] ; 
first  () ; 

> 

[ — snip — ] 
void  first  0 

4 

for  (ulong  k=0;  k<n_;  ++k)  x_  [k]  = n_-l-k; 
for  (ulong  k=0;  k<n_-l;  ++k)  d_  [k]  = 0; 

} 


The  update  process  uses  rising  factorial  numbers.  Let  j be  the  position  where  the  digit  is  incremented 
and  d the  value  before  the  increment.  The  update 


permutation 

[ 0 3 4 5 2 1 ] 
[ 5 4 2 0 3 1 ] 


rf  act 

v — increment  at  j=3 

[12311]  < — = digit  before  increment  is  d=l 

[...21] 


is  done  in  three  steps: 

[ 0 3 4 5 2 1 ] 

[ 0 2 4 5 3 1 ] 

[ 5 4 2 0 3 1 ] 


[12  3 11] 

[12321]  < — = swap  positions  d=l  and  j+l=4 

[ . . .21]  < — = reverse  range  0. . . j 


The  corresponding  method  is 


1 bool  next() 

2 4 

3 


if  ( d_[0]==0  ) //  easy  case 
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4 

{ 

5 

d_ [0]  = 1 ; 

6 

swap2(x_[0],  x_  [1]  ) ; 

7 

return  true ; 

8 

} 

9 

else 

10 

{ 

11 

d_[0]  = 0; 

12 

ulong  j = 1; 

13 

ulong  ml  = 2;  //  nine  in  rising  factorial  base 

14 

while  ( d [j]==ml  ) 

15 

{ 

16 

d_  [j]  = 0; 

17 

++ml ; 

18 

++j; 

19 

} 

20 

21 

if  ( j==n_-l  ) return  false; 

//  current  permutation  is  last 

22 

23 

const  ulong  dj  = d_[j]; 

24 

d_[j]  = dj  + 1; 

25 

26 

swap2  ( x_  [dj]  , x_[j+l]  );  // 

swap  positions  dj  and  j+1 

27 

28 

{ //  reverse  range  [0 . . . j ] : 

29 

ulong  a = 0 , b = j ; 

30 

do 

31 

{ 

32 

swap2(x_[a],  x_  [b]  ) ; 

33 

++a; 

34 

— b; 

35 

> 

36 

while  ( a < b ) ; 

37 

} 

'$) 

return  true ; 

40 

} 

41 

> 

42  } 

About  220  million  permutations  per  second  can  be  generated  [FXT:  comb/perm-colex-demo.cc|.  With 
arrays  instead  of  pointers  the  rate  is  330  million  per  second. 

10.4  An  order  from  reversing  prefixes 


A surprisingly  simple  algorithm  for  the  generation  of  all  permutations  uses  mixed  radix  counting  with 
the  radices  [2,  3,  4,  . . .]  (column  digits  in  figure  10.4-Al.  Whenever  the  first  j digits  change  with  an 


increment,  the  permutation  is  updated  by  reversing  the  first  j + 1 elements  (the  method  is  given  in  [364] ). 

As  with  lex  order  the  first  half  of  the  permutations  are  the  complements  of  the  permutations  in  the  second 
half,  now  rewrite  relation  10.2-1  on  page  242  as 


R = ICI 


(10.4-1) 


to  see  that  the  first  half  of  the  inverse  permutations  are  the  reversed  inverse  permutations  in  the  second 
half.  This  can  (for  n even)  also  be  observed  from  the  positions  of  the  largest  element  in  the  inverse 
permutations.  A generator  is  [FXT:  class  pernurev  in  comb/perm-rev.h  : 

1 class  perm_rev 

2 { 

3 public: 

4 ulong  *d_;  //  mixed  radix  digits  with  radix  = [2,  3,  4,  ....  n-1,  (sentinel=-l)] 

5 ulong  *p_ ; //  permutation 

6 ulong  n_;  //  permutations  of  n elements 

8 public: 

9 perm_rev (ulong  n) 

10  { 

11  n_  = n; 

12  p_  = new  ulong [n_] ; 

13  d_  = new  ulong [n_] ; 

14  d_[n-l]  = -1UL;  //  sentinel 
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permutation 

rf  act 

inv 

perm 

0 

[ • 

1 

2 

3 ] 

[ . 

. ] 

[ . 

i 

2 3 

] 

1 

[ 1 

2 

3 ] 

[ 1 

. ] 

[ 1 

2 3 

] 

2 

[ 2 

1 

3 ] 

[ . 

1 

. ] 

[ 1 

2 

. 3 

] 

3 

[ • 

2 

1 

3 ] 

[ 1 

1 

. ] 

[ . 

2 

1 3 

] 

4 

[ 1 

2 

3 ] 

[ . 

2 

. ] 

[ 2 

1 3 

] 

5 

[ 2 

1 

3 ] 

[ 1 

2 

. ] 

[ 2 

i 

. 3 

] 

6 

[ 3 

i 

2 ] 

[ . 

1 ] 

[ 1 

2 

3 . 

] 

7 

[ • 

3 

1 

2 ] 

[ 1 

1 ] 

[ . 

2 

3 1 

] 

8 

[ 1 

3 

2 ] 

[ . 

i 

1 ] 

[ 2 

3 1 

] 

9 

[ 3 

1 

2 ] 

[ 1 

1 

1 ] 

[ 2 

i 

3 . 

] 

10 

[ • 

1 

3 

2 ] 

[ . 

2 

1 ] 

[ . 

1 

3 2 

] 

11 

[ 1 

3 

2 ] 

[ 1 

2 

1 ] 

[ 1 

3 2 

] 

12 

[ 2 

3 

1 ] 

[ . 

2 ] 

[ 2 

3 

. 1 

] 

13 

[ 3 

2 

1 ] 

[ 1 

2 ] 

[ 2 

3 

1 . 

] 

14 

[ • 

2 

3 

1 ] 

[ . 

i 

2 ] 

[ • 

3 

1 2 

] 

15 

[ 2 

3 

1 ] 

[ 1 

1 

2 ] 

[ 1 

3 

. 2 

] 

16 

[ 3 

2 

1 ] 

[ . 

2 

2 ] 

[ 1 

3 

2 . 

] 

17 

[ • 

3 

2 

1 ] 

[ 1 

2 

2 ] 

[ . 

3 

2 1 

] 

18 

[ 1 

2 

3 

. ] 

[ . 

3 ] 

[ 3 

1 2 

] 

19 

[ 2 

1 

3 

. ] 

[ 1 

3 ] 

[ 3 

i 

. 2 

] 

20 

[ 3 

1 

2 

. ] 

[ . 

i 

3 ] 

[ 3 

1 

2 . 

] 

21 

[ 1 

3 

2 

. ] 

[ 1 

1 

3 ] 

[ 3 

2 1 

] 

22 

[ 2 

3 

1 

. ] 

[ . 

2 

3 ] 

[ 3 

2 

. 1 

] 

23 

[ 3 

2 

1 

. ] 

[ 1 

2 

3 ] 

[ 3 

2 

1 . 

] 

Figure  10.4-A:  All  permutations  of  4 elements  in  an  order  where  the  first  j + 1 elements  are  reversed 


when  the  first  j digits  change  in  the  mixed  radix  counting  sequence  with  radices  [2,  3,  4,  . . .]. 


15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 


31 

32 

33 

34 


first ()  ; 

3- 


~perm_rev() 

4 

delete  []  p_; 
delete  []  d_; 

> 


void  first () 

4 

for  (ulong  k=0;  k<n_-l;  ++k)  d_ [k]  = 0; 
for  (ulong  k=0;  k<n_;  ++k)  p_  [k]  = k; 

> 


void  last() 

4 

for  (ulong  k=0;  k<n_-l;  ++k)  d_  [k]  = k+1; 
for  (ulong  k=0;  k<n_;  ++k)  p_  [k]  = n_-l-k; 

} 


The  update  routines  are  quite  concise: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


bool  next() 

4 

//  increment  mixed  radix  number: 
ulong  j = 0 ; 

while  ( d_[j]==j+l  ) 4 d_  [j]  =0 ; ++j  ; } 

//  j==n-l  for  last  permutation 

if  ( j!=n_-l  ) //  only  if  no  overflow 

4 

++d_  [j]  ; 

reverse(p_,  j+2) ; //  update  permutation 
return  true ; 

} 

else  return  false; 

} 

bool  prev() 

4 

//  decrement  mixed  radix  number: 
ulong  j = 0; 

while  ( d_[j]==0  ) 4 d_[j]=j+l;  ++j  ; } 
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22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 


//  j==n-l  for  last  permutation 

if  ( j!=n_-l  ) //  only  if  no  overflow 

{ 

-d_[j] ; 

reverse(p_,  j+2) ; //  update  permutation 
return  true ; 

} 

else  return  false; 


}; 


Note  that  the  routines  work  for  arbitrary  (distinct)  entries  of  the  array  p_  [] . 

An  upper  bound  for  the  average  number  of  elements  that  are  moved  in  the  transitions  when  generating 
all  N = n\  permutations  is  e « 2.7182818  so  the  algorithm  is  CAT.  The  implementation  generates  more 
than  140  million  permutations  per  second  [FXT:  comb/perm-rev-demo. cc  . Usage  of  the  class  is  simple: 

ulong  n = 4;  //  Number  of  elements  to  permute 

perm_rev  P(n); 

P . first  0 ; 
do 
{ 

//  Use  permutation  here 

} 

while  ( P.nextO  ); 

We  note  that  the  inverse  permutations  have  the  single-track  property,  see  section  |10.12  on  page  271] 

10.4.1  Method  for  unranking 

Conversion  of  a rising  factorial  number  into  the  corresponding  permutation  proceeds  as  exemplified  for 
the  16-th  permutation  (15  = 1 • 1 + 1 • 2 + 2 • 6,  so  d=  [1 , 1 , 2] ): 


1: 

P=E 

0, 

1, 

2, 

3 ] 

d=  [ 

0, 

0, 

0 

] 

// 

start 

13: 

P=[ 

2, 

3, 

0, 

1 1 

d=  [ 

0, 

0, 

2 

] 

// 

right 

rotate 

all  elements  twice 

15: 

P=[ 

0, 

2, 

3, 

1 1 

d=  [ 

0, 

1, 

2 

] 

// 

right 

rotate 

first 

three  elements 

16: 

P=[ 

2, 

0, 

3, 

1 ] 

d=  [ 

1, 

1, 

2 

] 

// 

right 

rotate 

first 

two  elements 

The  idea  can  be  implemented  as 

void  goto_rf act (const  ulong  *d) 

//  Goto  permutation  corresponding  to  d[]  (i.e.  unrank  d[]). 

//  d[]  must  be  a valid  (rising)  factorial  mixed  radix  string: 

//  d[]==[d(0),  d(l)  , d(2)  , ...,  d(n-2)]  (n-1  elements)  where  0<=d(j)<=j  + l 

for  (ulong  k=0;  k<n_;  ++k)  p_  [k]  = k; 
for  (ulong  k=0;  k<n_-l;  ++k)  d_  [k]  = d[k]; 

8 for  (long  j=n_-2;  j>=0;  — j)  rotate_right (p_ , j+2,  d_  [ j ] ) ; 

9 1 

Compare  to  the  method  of  section  |10.1.3|  on  page  |238| 

10.4.2  Optimizing  the  update  routine 

We  optimize  the  update  routine  by  observing  that  5 out  of  6 updates  are  the  swaps 

(0,1)  (0,2)  (0,1)  (0,2)  (0,1) 

We  use  a counter  ct_  and  modify  the  methods  first  O and  nextO  accordingly  [FXT:  class  perm_rev2 
in  comb/perm-rev2.h|: 

class  perm_rev2 

{ 

perm_rev2 (ulong  n) 

{ 

n_  = n; 

const  ulong  s = ( n_<3  ? 3 : n_  ) ; 
p_  = new  ulong  [s+1] ; 
d_  = new  ulong  [s] ; 
first () ; 
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12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 


[ — snip — ] 
ulong  next () 

//  Return  index  of  last  element  with  reversal. 

//  Return  n with  last  permutation. 

4 

if  ( ct_!=0  ) //  easy  case(s) 

{ 

— ct_ ; 

const  ulong  e = 1 + (ct_  & 1) ; 
swap2  (p_  [0]  , p_  [e]  ) ; 
return  e ; 

} 

else 

{ 

ct_  =5;  //  reset  counter 

ulong  j = 2;  //  note:  start  with  2 

while  ( d_[j]==j+l  ) { d_[j]=0;  ++j  ; } //  can  touch  sentinel 

++d_  [j]  ; 

reverse(p_,  j+2) ; //  update  permutation 
return  j + 1 ; 

} 

} 

[ — snip — ] 


The  speedup  is  remarkable,  about  275  million  permutations  per  second  are  generated  (about  8.5  cycles 
per  update)  [FXT:  comb/perm-rev2-demo.cc  . If  arrays  are  used  instead  of  pointers,  the  rate  drops  to 
about  200  M/s. 


10.5  Minimal-change  order  (Heap’s  algorithm) 


permutation 

swap 

digits 

rf act (perm) 

inv 

perm 

0 

[ . 

1 

2 

3 ] 

(0, 

0) 

[ . 

. 1 

[ . 

. 1 

[ . 

i 

2 3 

] 

1 

[ 1 

2 

3 ] 

(1, 

0) 

[ 1 

. 1 

[ 1 

. ] 

[ i 

2 3 

] 

2 

[ 2 

1 

3 ] 

(2, 

0) 

[ • 

1 

. 1 

[ 1 

1 

. ] 

[ i 

2 

. 3 

] 

3 

[ . 

2 

1 

3 ] 

(1, 

0) 

[ 1 

1 

. 1 

[ . 

1 

. ] 

[ . 

2 

1 3 

] 

4 

[ 1 

2 

3 ] 

(2, 

0) 

[ . 

2 

. 1 

[ . 

2 

. ] 

[ 2 

1 3 

] 

5 

[ 2 

1 

3 ] 

(1, 

0) 

[ 1 

2 

. 1 

[ 1 

2 

. ] 

[ 2 

i 

. 3 

] 

6 

[ 3 

1 

2 ] 

(3, 

0) 

[ . 

1 1 

[ 1 

2 

1 ] 

[ 2 

1 

3 . 

] 

7 

[ 1 

3 

2 ] 

(1, 

0) 

[ 1 

1 1 

[ . 

2 

1 ] 

[ 2 

3 1 

] 

8 

[ . 

3 

1 

2 ] 

(2, 

0) 

[ . 

1 

1 1 

[ . 

1 

1 ] 

[ . 

2 

3 1 

] 

9 

[ 3 

1 

2 ] 

(1, 

0) 

[ 1 

1 

1 1 

[ 1 

1 

1 ] 

[ 1 

2 

3 . 

] 

10 

[ 1 

3 

2 ] 

(2, 

0) 

[ • 

2 

1 1 

[ 1 

1 ] 

[ 1 

3 2 

] 

11 

[ . 

1 

3 

2 ] 

(1, 

0) 

[ 1 

2 

1 1 

[ . 

1 ] 

[ . 

1 

3 2 

] 

12 

[ • 

2 

3 

1 1 

(3, 

1) 

[ • 

2 1 

[ . 

2 ] 

[ • 

3 

1 2 

] 

13 

[ 2 

3 

1 ] 

(1, 

0) 

[ 1 

2 1 

[ 1 

2 ] 

[ 1 

3 

. 2 

] 

14 

[ 3 

2 

1 1 

(2, 

0) 

[ . 

1 

2 1 

[ 1 

1 

2 ] 

[ 1 

3 

2 . 

] 

15 

[ • 

3 

2 

1 ] 

(1, 

0) 

[ 1 

1 

2 1 

[ . 

1 

2 ] 

[ • 

3 

2 1 

] 

16 

[ 2 

3 

1 1 

(2, 

0) 

[ . 

2 

2 1 

[ • 

2 

2 ] 

[ 2 

3 

. 1 

] 

17 

[ 3 

2 

1 1 

(1, 

0) 

[ 1 

2 

2 1 

[ 1 

2 

2 ] 

[ 2 

3 

1 . 

] 

18 

[ 3 

2 

1 

. 1 

(3, 

2) 

[ • 

3 1 

[ 1 

2 

3 ] 

[ 3 

2 

1 . 

] 

19 

[ 2 

3 

1 

. 1 

(1, 

0) 

[ 1 

3 1 

[ . 

2 

3 ] 

[ 3 

2 

. 1 

] 

20 

[ 1 

3 

2 

. 1 

(2, 

0) 

[ . 

1 

3 1 

[ . 

1 

3 ] 

[ 3 

2 1 

] 

21 

[ 3 

1 

2 

. 1 

(1, 

0) 

[ 1 

1 

3 1 

[ 1 

1 

3 ] 

[ 3 

i 

2 . 

] 

22 

[ 2 

1 

3 

. 1 

(2, 

0) 

[ . 

2 

3 1 

[ 1 

3 ] 

[ 3 

1 

. 2 

] 

23 

[ 1 

2 

3 

. 1 

(1, 

0) 

[ 1 

2 

3 ] 

[ . 

3 ] 

[ 3 

1 2 

] 

Figure  10.5-A:  The  permutations  of  4 elements  in  a minimal-change  order.  Dots  denote  zeros. 


Figure  |10.5-A  shows  the  permutations  of  4 elements  in  a minimal- change  order:  just  2 elements  are 
swapped  with  each  update.  The  column  labeled  digits  shows  the  mixed  radix  numbers  with  rising 
factorial  base  in  counting  order.  Let  j be  the  position  of  the  rightmost  change  of  the  mixed  radix  string 
R.  Then  the  swap  is  (j  + 1,  x)  where  x = 0 if  j is  odd,  and  x = Rj  — 1 if  j is  even.  The  sequence  of 
values  j + 1 starts 

1,  2,  1,  2,  1,  3,  1,  2,  1,  2,  1,  3,  1,  2,  1,  2,  1,  3,  1,  2,  1,  2,  1,  4,  1,  2,  1,  ... 


The  n-th  value  (starting  with  n = 1)  is  the  largest  z such  that  z!  divides  n (entry  A055881  in  |3l2j  1 . 
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The  list  rising  factorial  representations  of  the  permutations  is  a Gray  code  only  for  permutations  of  up 
to  four  elements,  (column  labeled  rf  act  (perm)  in  figure  10.5-Al. 

An  implementation  of  the  algorithm  (given  in  |178j)  is  [FXT:  class  permJieap  in  comb/perm-heap.h  : 


1 class  perm_heap 

2 { 

3 public: 

4 ulong  *d_;  //  mixed  radix  digits  with  radix  = [2,  3,  4,  ....  n-1,  (sentinel=-l)] 

5 ulong  *p_ ; //  permutation 

6 ulong  n_;  //  permutations  of  n elements 

7 ulong  swl_,  sw2_;  //  indices  of  swapped  elements 

8 [ — snip — ] 


The  computation  of  the  successor  is  simple: 


1 

2 

3 

4 

5 

6 

7 

8 

18 

11 

12 

13 

14 

15 

16 

17 

18 


bool  nextO 

//  increment  mixed  radix  number: 
ulong  j = 0; 

while  ( d_[j]==j  + l ) { d_[j]=0;  ++j  ; } //  can  touch  sentinel 

//  j==n-l  for  last  permutation: 
if  ( j==n_-l  ) return  false; 

ulong  k = j+1 ; 

ulong  x = ( k&l  ? d_[j]  : 0 ); 

swap2(p_[k],  p_  [x] ) ; //  omit  statement  to  just  compute  swaps 

swl_  = k;  sw2_  = x; 

++d_  [j]  ; 
return  true ; 

} 

[ — snip — ] 


About  133  million  permutations  are  generated  per  second.  Often  one  will  only  use  the  indices  of  the 
swapped  elements  to  update  the  visited  configurations: 


1 void  get_swap (ulong  &sl,  ulong  &s2)  const  { sl=swl_;  s2=sw2_;  } 


Then  the  statement  swap2  (p_  [k]  , p_  [x] ) ; in  the  update  routine  can  be  omitted  which  leads  to  a rate 
of  215  M/s.  F igure 1 1 0 . 5- A| shows  the  permutations  of  4 elements.  It  was  created  with  the  program  [FXT: 
comb /perm- heap-demo. cc  . 


10.5.1  Optimized  implementation 


The  algorithm  can  be  optimized  by  treating  5 out  of  6 cases  separately,  those  where  the  first  or  second 
digit  in  the  mixed  radix  number  changes  [FXT:  class  permJieap2  in  comb/perm-heap2.h  : 

1 class  perm_heap2 

2 { 

3 public: 

4 ulong  *d_;  //  mixed  radix  digits  with  radix  = [2,  3,  4,  5,  ....  n-1,  (sentinel=-l)] 

5 ulong  *p_ ; //  permutation 

6 ulong  n_;  //  permutations  of  n elements 

7 ulong  swl_,  sw2_;  //  indices  of  swapped  elements 

8 ulong  ct_;  //  count  5 ,4,3 , 2 , 1 , (0) ; nonzero  ==>  easy  cases 

9 [ — snip — ] 

The  counter  is  set  to  5 in  the  method  first  ().  The  update  routine  is 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


ulong  nextO 

//  Return  index  of  last  element  with  reversal. 
//  Return  n with  last  permutation. 

if  ( ct_!=0  ) //  easy  cases 

{ 

— ct_ ; 

swl_  = 1 + (ct_  & 1);  //  ==  1,2, 1,2,1 

sw2_  = 0; 

swap2  (p_  [swl_]  , p_  [sw2_]  ) ; 
return  swl_; 

} 

else 
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{ 

ct_  =5;  //  reset  counter 

/ / increment  mixed  radix  number : 
ulong  j = 2; 

while  ( d_[j]==j+l  ) { d_[j]=0;  ++j  ; } //  can  touch  sentinel 

//  j==n-l  for  last  permutation: 
if  ( j==n_-l  ) return  n_; 

ulong  k = j+1; 

ulong  x = ( k&l  ? d_[j]  : 0 ); 
swap2  (p_  [k]  , p_  [x]  ) ; 
swl_  = k;  sw2_  = x; 

++d_  [j]  ; 

return  k; 

} 

Usage  of  the  class  is  shown  in  [FXT:  comb/perm-heap2-demo.cc  : 

1 do  { /*  visit  permutation  */  } while  ( P.next()!=n  ); 

The  rate  of  generation  is  about  280  M/s  (7.85  cycles  per  update),  and  460  M/s  (4.78  cycles  per  update) 
with  fixed  arrays. 

If  only  the  swaps  are  of  interest,  we  can  simply  omit  all  statements  involving  the  permutation  array  p_  [] . 
The  implementation  is  [FXT:  class  perm_heap2_swaps  in  comb/perm- heap2-swaps.h  , usage  of  the  class 
is  shown  in  [FXT:  comb/perm-heap2-swaps-demo.cc  . 

Heap’s  algorithm  and  the  optimization  idea  was  taken  from  the  excellent  survey  )305]  which  gives  several 
permutation  algorithms  and  implementations  in  pseudocode. 
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14 

15 

16 

17 

18 

19 

20 
21 
22 


25 

26 

27 

28 
29 


32 

33 


10.6  Lipski’s  Minimal-change  orders 


Several  algorithms  similar  to  Heap’s  method  are  given  in  Lipski’s  paper  [235] . 


10.6.1  Variants  of  Heap’s  algorithm 


Four  orderings  for  the  permutations  of  five  elements  are  shown  in  figure  |10.6-A  The  leftmost  order 
is  Heap’s  order.  The  implementation  is  given  in  [FXT:  class  perm_gray_lipski  in  comb/perm-gray- 
lipski.h  , the  variable  r determines  the  order  that  is  generated: 


1 class  perm_gray_lipski 

2 { 

3 [ — snip — ] 

4 ulong  r_;  //  order  (0<=r<4) : 

5 [ — snip — ] 

6 

7 bool  nextO 

8 { 


9 

// 

increment 

mixed  radix  number : 

10 

ulong  j = 0; 

11 

while  ( d_[j] 

==j+l  ) { d_  [j]  =0 ; ++j  ; } 

12 

if 

( j<n_-l  ) 

//  only  if  no  overflow 

13 

{ 

14 

const  ulong  d = d_[j]  ; 

ulong  x; 

17 

switch  ( 

r_  ) 

18 

{ 

19 

case  0: 

x = (j&l  ? 0 : d) ; break;  // 

Lipski(9)  ==  Heap 

20 

case  1 : 

x = (j&l  ? 0 : j-d);  break;  // 

Lipski (16) 

21 

case  2: 

x = (j&l  ? j-1  : d) ; break;  // 

Lipski (10) 

22 

default : 

x = (j&l  ? j-1  : j-d);  break;  // 

not  in  Lipski’s  paper 

23 

} 

24 

const  ulong  k = j+1; 

25 

swap2  (p_  [k]  , p_  [x]  ) ; 

26 

swl_  = k; 

sw2_  = x; 

10.6:  Lipski’s  Minimal- change  orders 
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method.  Next  to  the  permutations  the  swaps  are  shown  as  (x,y),  a swap  (x,  0)  is  given  as  (x). 
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27 

28 

29 

30 

31 

32 

33 

34 

The  top  lines  in  figure  |10.6-A|  repeat  the  statements  in  the  switch-block.  For  three  or  less  elements  all 
orderings  coincide,  with  n = 4 elements  the  orderings  for  r = 0 and  r = 2,  and  the  orderings  for  r = 1 
and  r = 3 coincide.  About  110  million  permutations  per  second  are  generated  [FXT:  comb/perm-gray- 
lipski-demo.cc|.  Optimizations  similar  to  those  for  Heaps  method  should  be  obvious. 


10.6.2  Variants  of  Wells’  algorithm 


d_  [j]  = d + 1; 
return  true ; 

} 

else  return  false;  //  j==n-l  for  last  permutation 

} 

[ — snip — ] 

}; 


( 

Cj&l) 

1 

(d<= 

= 1) 

? j : 

j-d  ) ; 

x=( 

(j&l)  II 

(d= 

==0) 

? 0 

d 

1 

[ 

1 

2 

3 ] 

1 

[ • 

1 

2 

3 ] 

2 

[ 

i 
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3 ] 

(1, 

0) 

2 

[ 1 

2 

3 ] 

(1, 

0) 

3 

[ 
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2 

3 ] 

(2, 

i) 

3 

[ 2 

1 

3 ] 

(2, 

0) 

4 

[ 

2 

1 

3 ] 

(1, 

0) 

4 

[ . 

2 

1 

3 ] 

(1, 

0) 

5 

[ 

2 

1 

3 ] 

(2, 

1) 

5 

[ 1 

2 

3 ] 

(2, 

0) 

6 

[ 

2 

1 

3 ] 

(1, 

0) 

6 

[ 2 

1 

3 ] 

(1, 

0) 

7 

[ 

2 

3 

1 ] 

(3, 

2) 

7 

[ 3 

1 

2 ] 

(3, 

0) 

8 

[ 

2 

3 

1 1 

(1, 

0) 

8 

[ 1 

3 

2 ] 

(1, 

0) 

9 

[ 

2 

3 

1 ] 

(2, 

1) 

9 

[ • 

3 

1 

2 ] 

(2, 

0) 

10 

[ 

3 

2 

1 1 

(1, 

0) 

10 

[ 3 

1 

2 ] 

(1, 

0) 

11 

[ 

3 

2 

1 ] 

(2, 

1) 

11 

[ 1 

3 

2 ] 

(2, 

0) 

12 

[ 

3 

2 

1 ] 

(1, 

0) 

12 

[ • 

i 

3 

2 ] 

(1, 

0) 

13 

[ 

3 

1 

2 ] 

(3, 

2) 

13 

[ 2 

1 

3 

. ] 

(3, 

0) 

14 

[ 

3 

1 

2 ] 

(1, 

0) 

14 

[ 1 

2 

3 

. 1 

(1, 

0) 

15 

[ 

3 

i 

2 ] 

(2, 

1) 

15 

[ 3 

2 

1 

. 1 

(2, 

0) 

16 

[ 

1 

3 

2 ] 

(1, 

0) 

16 

[ 2 

3 

1 

. ] 

(1, 

0) 

17 

[ 

1 

3 

2 ] 

(2, 

1) 

17 

[ 1 

3 

2 

. 1 

(2, 

0) 

18 

[ 

1 

3 

2 ] 

(1, 

0) 

18 

[ 3 

1 

2 

. 1 

(1, 

0) 

19 

[ 

2 

1 

3 

. ] 

(3, 

0) 

19 

[ 3 

2 

1 1 

(3, 

1) 

20 

[ 

1 

2 

3 

. 1 

(1, 

0) 

20 

[ . 

3 

2 

1 1 

(1, 

0) 

21 

[ 

1 

3 

2 

. ] 

(2, 

1) 

21 

[ 2 

3 

1 1 

(2, 

0) 

22 

[ 

3 

1 

2 

. 1 

(1, 

0) 

22 

[ 3 

2 

1 1 

(1, 

0) 

23 

[ 

3 

2 

1 

. 1 

(2, 

1) 

23 

[ • 

2 

3 

1 1 

(2, 

0) 

24 

[ 

2 

3 

1 

. 1 

(1, 

0) 

24 

[ 2 

3 

1 1 

(1, 

0) 

Figure  10.6-B:  Wells’  order  for  the  permutations  of  four  elements  (left)  and  an  order  where  most  swaps 
are  with  the  first  position  (right).  Dots  denote  the  element  zero. 


A Gray  code  for  permutations  given  by  Wells  13501  is  shown  in  the  left  of  figure  10.6-B  The  following 
implementation  includes  two  variants  of  the  algorithm.  We  just  give  the  crucial  assignments  in  the 
computation  of  the  successor  [FXT:  class  perm_gray_wells  in  comb/perm-gray-wells.h  : 


1 bool  next() 

2 { 

3 [ — snip — ] 


4 

5 

switch  ( 
{ 

r_  ) 

6 

case  1 : 

x = ( 

(j&l) 

1 1 

(d==0) 

? 0 

d-1  ) 

break; 

//  Lipski(14) 

7 

case  2: 

x = ( 

(j&l) 

1 1 

(d==0) 

? j 

d-1  ) 

break; 

//  Lipski(15) 

8 

9 

default : 
} 

x = ( 

(j&l) 

1 1 

(d<=l) 

? j 

j-d  ) 

break; 

//  Wells’  order  ==  Lipski(8) 

10  [ — snip — ] 

11  > 


Both  expressions  (d==0)  can  be  changed  to  (d<=l)  without  changing  the  algorithm.  About  105  million 
permutations  per  second  are  generated  [FXT:  comb/perm-gray-wells-demo.cc  . 
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permutation 

swap 

inverse  p. 

direction 

0 

[ . 1 

2 3 

1 
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.12  3] 

+ 

+ 

+ 

+ 

1 
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+ 

+ 
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[ 1 2 

. 3 
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c 
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+ 

+ 

+ 

+ 

3 

[ 1 2 
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i 
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+ 

+ 

+ 

4 

[ 2 1 

3 . 

1 

(0,  1) 

i 
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- 

+ 

+ 

+ 
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i 
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- 
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+ 

+ 
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[ 2 . 

1 3 

1 

(2,  1) 

i 
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+ 

+ 

+ 

7 

[ . 2 

1 3 

1 

(1,  0) 

i 
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+ 

+ 
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i 
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+ 

+ 

+ 

+ 

9 

[ 2 . 
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1 
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i 

13.2] 

+ 

+ 

+ 

+ 

10 

[ 2 3 

. 1 

1 

(1,  2) 

i 

2 3.1] 

+ 

+ 

+ 

+ 

11 
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1 . 

1 
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i 

3 2.1] 

+ 

+ 

+ 

+ 

12 
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1 
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i 

3 2 1.] 
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- 

+ 

+ 

13 

[ 3 2 

. 1 

1 

(3,  2) 

i 

2 3 1.] 

- 
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+ 

+ 

14 

[ 3 . 

2 1 

1 

(2,  1) 

i 

13  2.] 

- 

- 

+ 

+ 

15 

[ . 3 

2 1 

1 

(1,  0) 

i 

.321] 

- 

- 

+ 

+ 

16 

[ . 3 

1 2 

1 

(3,  2) 

i 

.231] 

+ 

- 

+ 

+ 

17 
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1 

(0,  1) 

i 

12  3.] 

+ 

- 

+ 

+ 

18 

[ 3 1 

. 2 

1 

(1,  2) 

i 

2 13.] 

+ 

- 

+ 

+ 

19 

[ 3 1 

2 . 

1 

(2,  3) 

i 

3 12.] 

+ 

- 

+ 

+ 

20 

[ 1 3 

2 . 

1 

(1,  0) 

i 

3.21] 

- 

- 

+ 

+ 

21 

[ 1 3 

. 2 

1 

(3,  2) 

i 

2.31] 

- 

- 

+ 

+ 

22 

[ 1 . 

3 2 

1 

(2,  1) 

i 

1.32] 

- 

- 

+ 

+ 

23 

[ . 1 

3 2 

1 

(1,  0) 

i 

.13  2] 

— 

— 

+ 

+ 

Figure  10.7-A:  The  permutations  of  4 elements  in  a strong  minimal-change  order  (smallest  element 

moves  most  often).  Dots  denote 

zeros. 

— 

— 

perm(4)== 

p=[i, 

2,  3] 

[0,  1,  2,  3] 

— > 

[0,  1,  2, 

3] 

[1,  0,  2,  3] 

— > 

[1,  0,  2, 

3] 

[1,  2,  0,  3] 

— 

— 

— > 

[1,  2,  0, 

3] 

[1,  2,  3,  0] 

P= 

!3] 

— > 

[1,  2,  3, 

0] 

[2,  1,  3,  0] 

->  [2,  3] 

[2,  1,  0,  3] 

->  [3,  2] 

P=  [2, 

1,  3] 

[2,  0,  1,  3] 

— > 

[2,  1,  3, 

0] 

[0,  2,  1,  3] 

— > 

[2,  1,  0, 

3] 

[0,  2,  3,  1] 

— > 

[2,  0,  1, 

3] 

[2,  0,  3,  1] 

— > 

[0,  2,  1, 

3] 

[2,  3,  0,  1] 

[2,  3,  1,  0] 

P=  [2, 

3,  1] 

[3,  2,  1,  0] 

— > 

[0,  2,  3, 

1] 

[3,  2,  0,  1] 

— > 

[2,  0,  3, 

11 

[3,  0,  2,  1] 

— 

— 

— 

— > 

[2,  3,  0, 

11 

[0,  3,  2,  1] 

P= 

:2,  3i 

— > 

[2,  3,  1, 

0] 

[0,  3,  1,  2] 

-->  [1,  2, 

31 

[3,  0,  1,  2] 

“>  [2,  1, 

31 

P=[3, 

2,  1] 

[3,  1,  0,  2] 

->  [2,  3, 

11 

— > 

[3,  2,  1, 

0] 

[3,  1,  2,  0] 

— > 

[3,  2,  0, 

11 

[1,  3,  2,  0] 

P= 

:3,  2] 

— > 

[3,  0,  2, 

1] 

[1,  3,  0,  2] 

-->  [3,  2, 

11 

— > 

[0,  3,  2, 

1] 

[1,  0,  3,  2] 

-->  [3,  1, 

21 

[0,  1,  3,  2] 

->  [1,  3, 

2] 

P=[3, 

1,  2] 

— > 

[0,  3,  1, 

21 

— > 

[3,  0,  1, 

2] 

— > 

[3,  1,  0, 

2] 

— > 

[3,  1,  2, 

0] 

P=[l, 

3,  2] 

— > 

[1,  3,  2, 

0] 

— > 

[1,  3,  0, 

2] 

— > 

[1,  0,  3, 

2] 

— > 

[0,  1,  3, 

2] 

Figure  10.7-B:  Trotter’s  construction  as  an  interleaving  process. 


254 


Chapter  10:  Permutations 


10.7  Strong  minimal-change  order  (Trotter’s  algorithm) 


Figure  10.7-A  shows  the  permutations  of  4 elements  in  a strong  minimal-change  order : just  two  elements 
are  swapped  with  each  update  and  these  are  adjacent.  In  the  sequence  of  the  inverse  permutations  the 
swapped  pair  always  consists  of  elements  x and  x + 1.  Also  the  first  and  last  permutation  differ  by 
an  adjacent  transposition  (of  the  last  two  elements).  The  ordering  can  be  obtained  by  an  interleaving 
process  shown  in  figure  |10.7-B|  The  first  half  of  the  permutations  in  this  order  are  the  reversals  of  the 
second  half:  the  relative  order  of  the  two  smallest  elements  is  changed  only  with  the  transition  just  after 
the  first  half  and  reversal  changes  the  order  of  these  two  elements.  Mutually  reversed  permutations  lie 
n\/2  positions  apart. 


A computer  program  to  generate  all  permutations  in  the  shown  order  was  given  1962  by  H.  F.  Trotter  1334] , 
see  also  ms\  and  [137].  We  compute  both  the  permutation  and  its  inverse  [FXT:  class  perm_trotter 
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it 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 
23 

Sentinel  elements  are  put  at  the  lower  and  the  higher  end  of  the  array  for  the  permutation.  For  each 
element  we  store  a direction- flag  = ±1  in  an  array  d_  [] . Initially  all  are  set  to  +1: 


in  comb/perm-trotter. h : 


class  perm_trotter 
{ 

public : 

ulong  n_; 
ulong  *x_ ; 
ulong  *xi_ 
ulong  *d. 


//  number  of  elements  to  permute 
//  permutation  of  {0,  1,  ....  n-1} 
//  inverse  permutation 
//  auxiliary:  directions 


ulong  swl_,  sw2_;  //  indices  of  elements  swapped  most  recently 
public : 

perm_trotter (ulong  n) 

1 

n_  = n; 

x_  = new  ulong [n_+2] ; 
xi_  = new  ulong  [n_] ; 
d_  = new  ulong  [n_] ; 

ulong  sen  =0;  //  sentinel  value  minimal 

x_  [0]  = x_[n_+l]  = sen; 

++x_ ; 
f irst  () ; 

} 

[ — snip — ] 


1 void  fl_swaps() 

2 //  Auxiliary  routine  for  firstO  and  lastO. 

3 //  Set  swl,  sw2  to  swaps  between  first  and  last  permutation. 
4{ 

5 swl_  = ( n_==0  ? 0 : n_  - 1 ) ; 

6 sw2_  = ( n_<2  ? 0 : n_  - 2 ); 

7 } 

8 

9 void  firstO 


10 

11 

1 

for  (ulong  i=0; 

i<n_ ; 

i++) 

II 

■H 

i i 

1 

•H 

X 

12 

for  (ulong  i=0; 

i<n_ ; 

i++) 

x_[i]  = 

13 

for  (ulong  i=0; 

i<n_ ; 

i++) 

d_[i]  = 

14 

15 

} 

f l_swaps () ; 

16 

[ — snip — ] 

To  compute  the  successor,  find  the  smallest  element  el  whose  neighbor  e2  (left  or  right  neighbor,  accord- 
ing to  the  direction)  is  greater  than  el.  Swap  the  elements  el  and  e2,  and  change  the  direction  of  all 
elements  that  could  not  be  moved.  The  locations  of  the  elements,  il  and  i2,  are  found  with  the  inverse 
permutation,  which  has  to  be  updated  accordingly: 

1 

2 

3 

4 

5 

6 


bool  nextO 

■C 

for  (ulong  el=0;  el<n_;  ++el) 

{ 

//  el  is  the  element  we  try  to  move 

ulong  il  = xi_[el];  //  position  of  element  el 
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255 


7 ulong  d = d_[el]; 

8 ulong  i2  = il  + d; 

9 ulong  e2  = x_[i2]; 


10 

11 

if 

( el  < e2  ) // 

12 

{ 

13 

xi_[el]  = i2; 

14 

xi_  [e2]  = il; 

15 

x_[il]  = e2; 

16 

x_  [i2]  = el; 

17 

swl_  = il;  sw2 

18 

while  ( el — ) 

19 

return  true; 

20 

} 

21 

} 

22 

23 

first () ; 

24 

return 

false ; 

25 

} 

//  direction  to  move  el 
//  position  to  swap  with 
//  element  to  swap  with 

can  we  swap? 


- = i2> 

d_  [el]  = -d_  [el]  ; 


The  locations  of  the  swap  are  retrieved  by  the  method 


1 void  get_swap (ulong  &sl,  ulong  &s2)  const 

2 { sl=swl_;  s2=sw2_;  } 


The  last  permutation  is  computed  as  follows: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


void  last() 

{ 

for  (ulong  i=0;  i<n_;  i++)  xi_  [i]  = i; 
for  (ulong  i=0;  i<n_;  i++)  x_[i]  = i; 
for  (ulong  i=0;  i<n_;  i++)  d_[i]  = -1UL; 
f l_swaps () ; 

d_[swl_]  = +1;  d_  [sw2_]  = +1; 
swap2(x_  [swl_]  , x_  [sw2_]  ) ; 
swap2(xi_  [swl_]  , xi_[sw2_]); 

} 


The  routine  for  the  predecessor  is  almost  identical  to  the  method  next  ( ) : 


1 bool  prev() 

2 { 

3 [ — snip — ] 

4 ulong  d = -d_[el];  //  direction  to  move  el  (NOTE:  negated) 

5 [ — snip — ] 

6 last () ; 

7 return  false; 

8 } 


The  routines  nextO  and  prevO  generate  about  145  million  permutations  per  second, 
was  created  with  the  program  [FXT:  comb/perm-trotter-demo. cc  : 

ulong  n = 4; 
perm_trotter  P(n) ; 
do 

4 

//  visit  permutation 

> 

while  ( P. nextO  ); 


Figure  |10.7-A 


10.7.1  Optimized  update  routines 

The  element  zero  is  moved  most  o |comb/nolx1-gray-demo.cc : 


ulong n; // number of bits in words 
ulong *rv; // bits of the word 


void nolxi_rec(ulong d, bool z) 


if ( d=n ) { if ( d<=n+2 ) visitO; } 
else 


if (z) 
{ 
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rv[d]=1; rv[d+1]=0; rv[d+2]=0; noixi_rec(d+3, z); 
rv[d]=1; rv[d*i]-1; rv[d+2]=0; rv[d+3]=0; noixi rec(d44, !z); 
rv[d]20; noixi rec(d4*1, z); 

} 


else 


rv[d]20; noixi rec(d4*1, z); 
rv[d]21; rv[d+1]=1; rv[d+2]=0; rv[d+3]=0; noixi rec(d44, !z); 
rv[d]=1; rv[d+1]=0; rv[d+2]=0; noixi_rec(d+3, z); 


} 
} 


The sequence of the numbers v(n) of length-n strings starts as 


n: 01234 5 6 7 8 9 10 11 12 13 14 15 16 17 
v(n): 1246 9 15 25 40 64 104 169 273 441 714 1156 1870 3025 4895 


This is entry A006498 in [312]. The recurrence relation is 


v(n) = v(n—1)+v(n—-3) + v(n— 4) (14.10-2) 
The generating function is 
= a 1+zr+2r? +r’ 
2-008 = Seas (14.10-3) 


14.10.2 No substrings 1xyl 
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Figure 14.10-B: The length-10 binary strings with no substring 1xy1 (where x and y are either 0 or 1) 
in minimal-change order. Dots denote zeros. 


Figure|14.10-B|shows a Gray code for binary words with no substring 1xy1. The recursion for the list of 
n-bit words Y (n) is 


[1000.Y(n-4) ] 
[101000 . YR (n — 6)] 
Y(n) = [111000.Y(n—6) ] (14.10-4) 
[11000. YP(n — 5) ] 
[0 . Y (n — 1) ] 


An implementation is given in [FXT: comb/nolxyl-gray-demo.cc': 


void Y_rec(long p1, long p2, bool z) 


{ 
if ( p1>p2 ) { visitQ; return; $ 


#define S1(a) rv[pi*0]-a 
#define S2(a,b) S1(a); rv[p1+1]=b; 
#define S3(a,b,c) S2(a,b); rv[pit2]=c; 
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#define S4(a,b,c,d) S3(a,b,c); rv[p1+31=d; 


#define S5(a,b,c,d,e) S4(a,b,c,d); rv[p1+4]=e; 
#define S6(a,b,c,d,e,f) S5(a,b,c,d,e); rv[pi*5]-f; 
long d = p2 - pl; 
a (z) 
if (da >= 0 ) 4{ 84(1,0,0,0); Y_rec(pit4, p2, z); } //1000 
if (d >= 2 ) { S6(1,0,1,0,0,0); Y_rec(p1+6, p2, !z); } //101000 
if (d >= 2 ) 1586(1,1,1,0,0,0); Y_rec(p1+6, p2, z); } //111000 
if (d>=1) f{ $5(1,1,0,0,0); Y_rec(pit5, p2, !z); } //11000 
if (d>=0) { S1(0); Y rec(piti, p2, z); Y //0 
} 
‘aa 
if (d>=0) { S1(0); Y rec(piti, p2, z); } //0 
if (d>=1) f{ $5(1,1,0,0,0); Y_rec(pit5, p2, !z); } //11000 
if (d >= 2 ) { S$6(1,1,1,0,0,0); Y rec(pi*6, p2, z); } //111000 
if (d>=2) 1586(1,0,1,0,0,0); Y rec(pi*6, p2, !z); } //101000 
if (d>=0) 4{ 84(1,0,0,0); Y_rec(pit4, p2, z); } //1000 
F 


Note the conditions if ( d >= ? ) that make sure that no string appears repeated. The initial call is 
Y_rec(0, n-1, 0). The sequence of the numbers y(n) of length-n strings starts as 


n: 0123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
y(n): 12 4 8 12 17 25 41 69 114 180 280 440 705 1137 1825 2905 4610 
The generating function is 


a ltot2e?+42¢2 43044225 


> une = (14.10-5) 


1 — x — zt — z5 — 2 gô 
n=0 


14.10.3 Neither substrings 1x1 nor substrings 1xy1 
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Figure 14.10-C: A Gray code for the length-10 binary strings with no substring 1x1 or 1xyl. 


A recursion for a Gray code of the n-bit binary words Z(n) with no substrings 1x1 or 1xy1 (shown in 
figure}14.10-C) is 
[1000.Z(n—4) | 
Z(n) = [1000.ZR(n-— 5) (14.10-6) 
[0 . Z(n — 1) | 
The sequence of the numbers z(n) of length-n strings starts as 
n: 01234 5 6 7 8 910 11 12 13 14 15 16 17 
z(n): 1246 8 11 17 27 41 60 88 132 200 301 449 669 1001 1502 
The sequence is (apart from three leading ones) entry |A079972 in [312] where two combinatorial inter- 
pretations are given: 


Number of permutations satisfying -k<=p(i)-i<=r and p(i)-i not in I, i-1..n, 
with k-1, r-4, I={1,2}. 
Number of compositions (ordered partitions) of n into elements of the set {1,4,5}. 


The generating function is 


1+12+212+2g 7 y 
Yn) = oui a E a (14.107) 


1— x — zt — a5 
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Chapter 15 


Parentheses strings 


We give algorithms to list all well-formed strings of n pairs of parentheses. In the spirit of [211] we use 
the term paren string for a well-formed string of parentheses. A generalization, the k-ary Dyck words, is 
described at the end of the section. 


If the problem at hand appears to be somewhat esoteric, then see vol.2, exercise 6.19, p.219] for 
many kinds of objects isomorphic to our paren strings. Indeed, as of May 2010, 180 kinds of combinatorial 
objects counted by the Catalan numbers (which may be called Catalan objects) have been identified, see 


[321] and also [320]. 
15.1  Co-lexicographic order 


1: €€(€020))) 14141..... 22: (O)COQO) 11..11.1.. 
2: ((CQQ))) 1111.1.... 23: OO(OO) 1.1.11.1.. 
3: ((OCO))) 111.11.... 24: (€O9))€O) 111...11.. 
4: (QOC(Q))) 11.111.... 25: (OQO)(Q) 11.1..11.. 
5: OCONI 1.1111.... 26: OCOOCO) 1.11..11.. 
6: (((0)O)) 1111..1... 27: (O)OCO) 11..1.11.. 
7T: ((000)) 111.1.1... 28: O0OCO) 1.1.1.11.. 
8: (OCOO)) 11.11.1... 29: (((QO)))Q  1111....1. 
9: OCCOQO)) 1.111.1... 30: ((OOQ))O 111.1...1. 
10: ((O9(O)) 111..11... 31: (OCQ))O 11.11...1. 
11: (OOCO)) 11.1.11... 32: O(CO)O 1.111...1. 
12: OCOCO)) 1.11.11... 33: (CO)O)O  111..1..1. 
13: (O)(CO)) 11..111... 34: (000)O  11.1.1..1. 
14: QOQOCQO)) 1.1.111... 35: O(00JO 1.11.1..1. 
15: (((O))Q0) 1111...1.. 36: (O)CQ)QO  11..11..1. 
16: ((00)O) 111.1..1.. 37: OOCO)O 1.1.11..1. 
17: (OCQ)O) 11.11..1.. 38: ((O))QQ  111...1.1. 
18: OCCO)QO) 1.111..1.. 39: (OQ)OOQ 11.1..1.1. 
19: ((C0)OO) 111..1.1.. 40: OC(OJOO 1.11..1.1. 
20: (0000) 11.1.1.1.. 41: (O)OOO 11..1.1.1. 
21: O(000) 1.11.1.1.. 42: 00000 1.1.1.1.1. 


Figure 15.1-A: All (42) valid strings of 5 pairs of parentheses in co-lexicographic order. 


An iterative scheme to generate all valid ways to group parentheses can be derived from a modified 
version of the combinations in co-lexicographic order (see section|6.2.2 on page 178). For n = 5 pairs the 
possible combinations are shown in figure|15.1-A| This is the output of [FXT:|comb/paren-demo.cc . 


Consider the sequences to the right of the paren strings as binary words (these are often called (binary) 
Dyck words). If the leftmost block has more than a single one, then its rightmost one is moved one 
position to the right. Otherwise (the leftmost block consists of a single one and) the ones of the longest 
run of the repeated pattern ‘1.’ at the left are gathered at the left end and the rightmost one in the next 
block of ones (which contains at least two ones) is moved by one position to the right and the rest of the 
block is gathered at the left end (see the transitions from #14 to #15 or #37 to #38). 


The generator is [FXT: class paren in comb/paren.h : 


OoN OJA WON 
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class paren 


{ 
public: 
ulong k_; // Number of paren pairs 
ulong n_; // ==2*k 
ulong *x_; // Positions where an opening paren occurs 
char *str_; // String representation, e.g. "((0)0)O" 


public: 
paren (ulong k) 
{ 


k_ = (k>1 ? k : 2); // not zero (empty) or one (trivial: "()") 
n = 2 * k_; 
x_ = new ulong[k_ + 1]; 
x_[k_] = 999; // sentinel 
str_ = new char[n_ + 1]; 
str_[n_] = 0; 
first(); 

} 

“paren() 
delete [] x_; 
delete [] str_; 

} 


void first() 4 for (ulong i-0; i«k ; ++i) x [i] = i; } 
void last() 1 for (ulong i-0; i«k ; ++i) x [i] = 2*i; } 
[--snip--] 


The code for the computation of the successor and predecessor is quite concise. A sentinel x[k] is used 
to save one branch in the generation of the next string 


ulong next() // return zero if current paren is the last 
// if (k --1) return 0; // uncomment to make algorithm work for k -- 


ulong j = 0; 
Pd ( x.[1] == 2 ) 


// scan for low end == 010101: 


while ( x [jl--2*j ) ++j; // can touch sentinel 
if ( j==k_) 41 first(); return 0; > 
} 


// scan block: 
while ( 1 == (x_[j+1] - x_[j]) ) £ ++j; } 


++x_[j]; // move edge element up 
for (ulong i-0; i<j; ++i) x [i] = i; // attach block at low end 
return 1; 

i 

ulong prev() // return zero if current paren is the first 

{ 


// if (k_==1 ) return 0; // uncomment to make algorithm work for k_== 


ulong j = 0; 
// scan for first gap: 


while ( x_[j]==j ) ++j; 
if ( j==k_) ( lastQ); return 0; } 
if ( x_[j]-x_[j-1] == 2) --x_[j]; // gap of length one 
else 
ulong i = --x [jl; 
--i; 
// j items to go, distribute as 1.1.1.11111 
for ( ; 2*i>j; --i,--j) x_[j] = i; 
for ( ; i; --i) x [i] = 2*i; 
x [0] = 0; 


NOOR WN A 
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return 1; 


const ulong * data() const { return x ; } 
[--snip--] 
'The strings are set up on demand only: 
const char * string() // generate on demand 
for (ulong j=0; j«n ; ++j) str [jl = >)”; 
for (ulong j=0; j<k_; **j) str [x [jl] =° €; 
return str. ; 


} 
3 


The 477,638, 700 paren words for n = 18 are generated at a rate of about 67 million objects per second. 
Section|1.28 on page 78| gives a bit-level algorithm for the generation of the paren words in colex order. 


15.2 Gray code via restricted growth strings 


1: [ 0, 0, 0, O, J 0000 1.1.1.1. 
2: [ 0, 0, 0, 1, ] OQ O0OCO) 1:1.11.. 
3: [0,0, 1, 0, ] 0(0)O 1.11..1 
4: [ 0, 0, 1, 1, ] 0(00) 1.11.1.. 
5: [ 0, 0, 1, 2, ] OCCOD) 1.111... 
6: [0, 1, 0, O, ] (0)00 11..1.1. 
T: [0, 1, 0, 1, ] COXCO) 11..11. 
8: [0, 1, 1, 0, ] (00)O 11.1..1 
9: [ 0, 1, 1, 1, ] (000) 11.1.1 
10: [ 0, 1, 1, 2, ] COCOD) 11.11 
11: [0,1, 2, 0, ] «(OJO 111...1 
12: [ 0, 1, 2, 1, ] (CO Q) 111..1 
13: [0,1, 2, 2, ] (CO Q2) 111.1 
14: [0,1,2, 3, ] CCCO20) 1111 


O 00 N OO 4i mn 


Figure 15.2-A: Length-4 restricted growth strings in lexicographic order (left) and the corresponding 
paren strings (middle) and delta sets (right). 


The valid paren strings can be represented by sequences ao, 41, ..., dy where ag = 0 and az < ax-1 +1. 
These sequences are examples of restricted growth strings (RGS). Some sources use the term restricted 
growth functions. 


The RGSs for n = 4 are shown in figure |15.2-A| (left). The successor of an RGS is computed by 
incrementing the highest (rightmost in figure|15.2-A] digit a; where a; < a;_ and setting a; = 0 for all 
i > j. The predecessor is computed by decrementing the highest digit a; 4 0 and setting a; = aj-1 +1 
for all i > j. 


The RGSs for a given n can be generated as follows [FXT: class catalan in comb/catalan.h|: 


class catalan 

/ Catalan restricted growth strings (RGS) 

// By default in near-perfect minimal-change order, i.e. 

// exactly two symbols in paren string change with each step 

ibis 
int *as_; // digits of the RGS: as [k] <= as[k-1] + 1 
int *d_; // direction with recursion (*1 or -1) 
ulongn.;  // Number of digits (paren pairs) 
char *str ; // paren string 
bool xdr_; // whether to change direction in recursion (==> minimal-change order) 
int dr0_; // dr0: starting direction in each recursive step: 
// dr0=+1 ==> start with as[]=[0,0,0,...,0] == "OOO...O0" 


Ti dr0=-1 ==> start with as[]=[0,1,2,...,n-1] == "((C ... )))" 


Chapter 15: Parentheses strings 


326 


N N N N N N N 


TN TS ON TS TS T8 TS ON ION TS T8 ON T8 ON T8 T8 T8 T8 ON ON ON TS ON T8 ON ON ON ON ON ON ON ON ON ON ON ON ON ON IT IT IT 


oreo 00 0 NON SS SESS SES SESS MMV MRR IRR IRR IN INR IN IN INS 
II SANSA SUA SUA SUA NON 0A; 4_rT~nee S NS S NON S S S S Su S Su S Su 0" S wr 0 Su S S S 


oo. dt o. oo ANA +. .- Al e . . . 4 XA... NU Ar ù ù ù o o NU 
a o oS ros or or e‘ ‘eei eei A e e O e O Mo tH 
. e= o a A o e ee («+ 
ee MEE E ME don on on o oc: o5 cond oo. coco: c: tert o O 

o oc:o5 ERR LL B LZÉ T oc c o à n odo O 
Tc ocs o: 5 05 5 t nr n t ss o: 0: 0: n on on nh n t Y oc n n à à à 
vod o oc o oco ooo oc: 0: 0: 0: 06 t on n n n n n n n v o0 1 oc o o o o 1 
ss S S S ll ll LGüÓIODG *^* 
aaa oc o o o oH 


ASA NA SUA HMR BRR IIS IIS IIS RRR A08 SA NS S NS NS MMM MYR RRR ONO ONES S S SS SS S NA 
MIO NODO SANA SANSA SUA NU SENSU NISSAN ASNO ANNO INN SOON S S S S S S S ee eee 
GGG GW WG DO WWW SENS S S S SA ANNAN ONO NONI NON IIS IIS 
II SUSANA SUSANA SUE SUSAN SUE SUA SS SS SENSA NS SES S SESS 0 NS NS SE S S S S S S 2 NS NS NA 


OOOO mnnm 
PLATE RRR G GA a! )3 1 a a a a A o! Pb FB Rb og! !d tL Ll G m GaoRARàGaAA 0! 0g! ia a 
OE d d POE P gd dod od P TE GGGGGGGGEGGG I ggg gctdc-cccagdgag)g)!g ggg! 
EREE te ot d od od dod od od od od od tot oO: d gd ! EEEE tE RE ct c4-| ck 4A kc 
Lo OE od ood d dod do d d! dod og dg Prob d Og d dod dod d od dod odo dod d gd o gd ! 
Lo OE od ood dod od d d! Po dog d Prob gd Og gd d gd dod Pd od dod od d od dod od ! 


LLL CIL LL JL dL LL JL JL JL dL JL JL JL Ji LLL CIL CL JL IL IL JL IL JL dL IL JL JL JL JL IL JL JL JL JL d 


[ = 

[ PE 

[ yes 

[ - — 


OOOO OOOO OOOO OOO O aaa 
MNN HOOH NMN HOOH à QO O à NMN O QO AN ADO AN AO O ANNO CV. O O à 
€) 00) 00) C) CO CN CN. CY NH H 4 (OQ QOO à n c ON CY OY OY H OO OO n n n COLO CY OY H HHOO 
CN. CN CN CN CN NNN CV CY. CV CV CN N SR 1 LH (OO. (QO QO QO O O O O O QO n à à n à uu udud 
OOOOOOOOOOOOOOOoOoOooooooooooooooooooooooooo 


LLL CIL CIL CIL JL CL IL JL JL JL JL IL JL JL JL JL JL IL CL IL JL JL JL JL JL JL JL IL JL IL JL JL IL JL JL JL JL JL JL JL JL ) 


From left to right: restricted 


Figure 15.2-B: Minimal-change order for the paren strings of 5 pairs. 


growth strings, arrays of directions, paren strings, delta sets, and difference strings. If the changes are not 


adjacent, then the distance of changed positions is given at the right. The order corresponds to drO 
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=0; k<n; ++k) 
0; k<n; ++k) 


21.03 


=0) ? +1 
for (ulong k 
for (ulong k 


( (dro> 
xdr; 


if ( dr0_>0 ) 


ulong n = n_; 
else 


delete [] as_; 
delete [] d_; 
delete [] str_; 


void init(bool xdr, int drO) 
drO 
xdr 
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15.2: Gray code via restricted growth strings 


N N N N N m N N 


DNDN TS TS TS TS TS TS TS T8 IT T8 TS T8 TS TS TS TS ON TS TS OOO ONION ON ON ON ODN ION TS ION TS TS ION ONION IT IT 
MAD iM P4 I IS IS RU IR A MTT RR RL PS IN IN RRR IU PS IN IS HA PS IN LIS S 


LA ASAS M PA mS 9G PÁ LU MDMA SNARES NR 8 PX 9X PX 95 PS e PA 


GIGS WG SUISSE SUE SUSANA SE SUA SE SES MYM Ad mS IS IIS ANI IIS IIS IR IIS IIS IIS IIS 
TN TN IIS IIS TS IIS IIS IIS IIS IN IN nn LY SANSA SS SS SU NN NIN SENSE SS SESS NS S S SA S S S NS NS 
ANA NANI SUA NODO SANSA NUES SA NODO EGOS i SE NS NON S S S S 00000 X28 sO 


o s HA o: o e HHA c: 1 01 a‘lo‘’ HAH c: MM: o: 
eH eA O | C i a - eH eA e nono 
HAHAH eee | HAAA -eA o oo o Mor runs |. 
E i >» e’ A Moo To: 55 | 
voc oc oco ooo: 0:5 5 5: 5 n on n n a 5 voc on n 
E E E E v oc oc oc oco oe ooo: 0: 0: 05 5 5 5 5 tH oc c c cc oc c c ----woc c: c: 5 
AA Eus S S lll LL LL i**QOUL*—1lL 
E TED Se^ cnl dhpu dr ES E 9l e VÀ o o o o o o o A A A o A ooo A A A A A A o o no o 
v o o o o o A A A A A A A A o oe o A A A A o A A A A o o ono A A o o A A 


08 08S PAYO NON IN a 08 SA NL ONO ONO T8 08 S OY RID m8 SA NN RMT. OS Se” 
ANA SUA SUA SUA SUA SUSANA RN RAR RAR RR RY NON RRR RY SM SN IRR IR IIS IIS S NS NS NS NA 
LOSER ORNS SS SENSU NOOO 78 ONES NON ZO NON SENSE NS SE SE S SA ON NS 
IG WB SENSU SE SES SE SE SS SENS YI S SA SOSA ON OSONA, 
LONER EON ERR IIS IIS IIS IIS NN NS IN NON O SANSA SA SENS S S S 0° 00000 0000000000000 NS S S NS 
II SUISSE SUIS SUIS NASA SENSE NS SE SESS SENS NS SENS S SES S S NS S NS NS ese S S S S 0 S NS NA 


AA Há Há Há Há Há Há Há HAHAHA HAHAHA AA AAA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA PAPA 
A A 
a E E E Eod dod od oo c G RRA B B RR RR HHH 
A E E E E E te ttt t dod od dodod dod od dod do do dod dod od od d» od gd I! 
Ok EE EEG B B B B G B GB Gé B Gé AA EE AA AAA AEREA EEE AA+ 
tHE GR GB B FB GB eR GB Ge AAA FAA RR RB kR RO RO XR x c Gc Ecl c cc kckRkGdGkceockbckckckBéB + 


LLL CIL CL IL JL CL IL IL IL LE JL IL IL JL JL JL CL IL LC QE JL QE IL LC QE JL JL IL JL CIL JL JL IL IL JL IL JL IL IL JL JL 


OOOO OOOO OOO OOOO OOOO OA AAA AA AAA AAA AAA 
OANAOOANNNAHOO à - QO O à QI MN  O O ANN sf OD CY 3 O O ANAODO AN HAHO 
QD QD d cd c CN CON CY CY c 7 0 QD Q QD O3 c c ON CY OY OY 00 00 00) 00) 00. ON. CY OY CY ATA OGO QO O à à à 
Q Q Q QO QO 4 d x c c c oc oc o CON OX OI CY OY CY CY CY CY CY OV OY CY OY 1 n HH HO OOOO 
OOOOOOOOOOOOOOOOoOooooooooooooooooooooooooo 


A CIL CIL CIL JL CL IL CIL JL JL JL IL JL JL JL JL IL JL CIL IL JL IL JL CL IL JL JL JL JL IL JL JL JL JL JL JL JL JL JL JL JL 


From left to right: restricted 


Figure 15.2-C: Minimal-change order for the paren strings of 5 pairs. 


growth strings, arrays of directions, paren strings, delta sets, and difference strings. If the changes are not 


adjacent, then the distance of changed positions is given at the right. The order corresponds to drO 
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const { return as_; } 


const int *get() 


return (const char*)str_; } 


{ make_str(); 


const char* str() 
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3; 
The minimal-change order is obtained by changing the ‘direction’ in the recursion, an essentially identical 


mechanism (for the generation of set partitions) is shown in chapter |17 on page 354 


given in [FXT: comb/catalan.cc : 


The function is 


= (ulong k) 


bool 
catalan: :next_rec 


1 
2 
3 


NV) Oe ee pl ps pl e ps 
HO 00 OD O01 405 h2 FO (O ONO 
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if ( k<i ) return false; // current is last 


int d = d_[x]; 
int as = as_[k] + d; 
bool ovq = ( (d>0) ? (as>as_[k-1]+1) : (as«0) ); 
if ( ovq ) // have to recurse 
{ 
ulong nsi = next rec(k-1); 
if ( O--ns1 ) return false; 
d = ( xdr_ ? -d : dr0_ ); 
d [k] = d; 


as = ( (d50) ? O : as_[k-1]+1 ); 


p 
[2 
e 
5 
| 
Il 


as; 


return true; 


} 


The program [FXT: comb/catalan-demo.cc) demonstrates the usage: 


ulong n = 4; 

bool xdr = true; 

int drO = -1; 

catalan C(n, xdr, dr0); 

do { /* visit string */ } while ( C.next() ); 
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About 69 million strings per second are generated. Figure|15.2-B| shows the minimal-change order for 


n = 5 and dr0=-1, and figure|15.2-C|for dr0=+1. 


More minimal-change orders 
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Figure 15.2-D: Strings of 5 pairs of parentheses in a Gray code order. 


The Gray code order shown in figure|15.2-D|can be generated via a simple recursion: 


ulong n; // Number of paren pairs 
ulong *rv; // restricted growth strings 
void next_rec(ulong d, bool z) 
{ 
if ( d==n ) visit(); 
else 
const long rvi = rv[d-1]; // left neighbor 
if ( 0==z ) 
for (long x=0; x<=rv1+1; ++x) // forward 
rv[d] = x; 
next rec(d*i, (x&1)); 
} 
else 
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{ 
20 for (long x=rv1+1; x>=0; --x) // backward 
22 rv[d] = x; 
23 next rec(d*i, !(x%1)); 

F 

25 } 
26 F 
27 } 


The initial call is next_rec(0, 0);. About 81 million strings per second are generated [FXT: 


comb /paren-gray-rec-demo.cc!. 


1: 00000 1.1.1.1.1. 22: (00CO)) 11.1.11.. 
2: QOOCO) 1.1.1.11.. 23: (000)O 11.1.1..1 
3: 00000) 1.1.11.1.. 24: ((0)000 111..1..1. 
4: QQOCCOOO) 1.1.111... 25: ((0)00)) 111.:.11... 
5: QOCO20 1.1.11..1. 26: ((0)OO) 111..1.1.. 
6: QCOOD00 1-11.1..1. 27: ((00)O) 111.1..1.. 
7T: QCOCOD 1.11.11... 28: ((000)) 111.1.1... 
8: 0(000) 1.11.1.1.. 29: ((0(0))) 111.11 
9: OCCQ)Q) 1.111..1.. 30: (002900 111.1...1 
10: Q(CCOOD 1.111.1... 31: (COJO 1111....1 
11: OCCCQO))) 1.1111.... 32: (((CO))) 11111..... 
12: OO JO gE Ee eae 33: (((00))) 1111.1.... 
13: O(OJOO 1.11..1.1, 34: (((0)0)) 1111..1.. 
14: OCO)CO) 1.11..11.. 35: ((CQ) 10) 1111...1.. 
15: (QQ) (QO) 11.1..11.. 36: (C0))€CO) 111...11 
16: (00)00 11.1..1.1. 37: (C00) 00 111...1.1 
17: (QCODO 11.11...1. 38: (0)OOO 11..1.1.1 
18: (O(CO)) 11.111... 39: (0)OCO) 11..1.11 
19: (0(00)) 11.41. 1. 40: (Q)COQ) 11..11.1 
20: (0(0)O) 11.-11::1 41: CQ)CCOO)) 11..111 
21: (0000) 11.4.1.1 42: (O)CO)IO 11..11..1 


Figure 15.2-E: Strings of 5 pairs of parentheses in Gray code order as generated by a loopless algorithm. 


A loopless algorithm (that does not use RGS) given in [329] is implemented in [FXT: class paren gray 


in comb/paren-gray.h. The generated order for five paren pairs is shown in figure |15.2-E| About 80 


million strings per second are generated [FX T: comb/paren-gray-demo.cc|. Still more algorithms for the 


parentheses strings in minimal-change order are given in [90], [337], and [363]. 

0: ... 1411 == (00000) 

1: ...1.111 == ((00)) ^s ...11... 
2: woe thd? 99 (0O (Q) FS pens 11.. 
3: ...111.1 == QQ(QO)) ^s ..... 11. 
4: ..1.11.1 == OCOOQ) = ..11.... 
5: ..1.1.11 == (000) | ^» ..... 11. 
6: ..1..111 == (C(O) QO) œ silas 
7: .1...111 == ((0))Q “= .11..... 
8: .1..1.11 == (OQ)Q œ= i 
9: .1..11.1 == OCO)Q ^s... 11. 
10: .1.1.1.1 == 0000 | ^5 ...11... 
11: ..11.1.1 == OQCO) = .11..... 
12: 11..11 == (0)(0) T= ..... 11. 
13: 1.1..11 == (QQ)OO = .11..... 


Figure 15.2-F: A strong minimal-change order for the paren strings of 4 pairs. 


For even values of n it is possible to generate paren strings in strong minimal-change order where changes 
occur only in adjacent positions. Figure shows an example for four pairs of parens. The listing 
was generated with [FXT: that uses directed graphs and the search 
algorithms described in chapter 
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1: (€C€(Q)))))  11111..... 22: ((0)00) 111..1.1 
2: O((CO))) 1.1111.. 23: O(O)(O) 1.11..11 
3: (O(CO))) 11.111.. 24: (00)(0) 11.1..11 
4: (COCO) 111.11.. 25: 0O00CO) 1.1.1.11 
5: (((00))) 1111.1.... 26: (O)O(O) 11..1.11 
6: O(COO) 1.111.1. 27: ((CO0))(0) 111...11 
7: (OCOQO)) 11.11.1. 28: ((0))O0) 1111...1.. 
8: ((000)) 111.1.1. 29: Q(CQ)QO 1.111...1. 
9: OC(OCO)) 1.11.11. 30: (OCO)O  11.11...1. 
10: (OOCO) 11.1.11. 31: ((0OO0)00O0  111.1...1. 
11: QOCCQ)) 1.1.111. 32: O(OOJO 1.11.1..1. 
12: (O)(CO)) 11..111 33: (OOO0)0O0 11.1.1..1. 
13: (CO)CO)) 111..11 34: OOC(OJO 1.1.11..1. 
14: (((C0)O)) 1111..1... 35: (O)(O)O  11..11..1. 
15: O((O)O) 1.111..1.. 36: ((CO)JO)O  111..1..1. 
16: (OCO)O0) 11.11..1.. 37: O(OJOO 1.11..1.1. 
17: ((0O0)O) 111.1..1.. 38: (OOJOQO  11.1..1.1. 
18: O(OOO) 1.11.1.1.. 39: 00000 1.1.1.1.1. 
19: (OOOO) 11.1.1.1.. 40: (OOOO 11..1.1.1. 
20: OOCOO) 1.1.11.1.. 41: (COVOQO 111...1.1. 
21: (0)(00) 11..11.1.. 42: ((C(O0))O  1111....1. 
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Figure 15.3-A: All strings of 5 pairs of parentheses generated via prefix shifts. 


15.3 Order by prefix shifts (cool-lex) 


The binary words corresponding to paren strings can be generated in an order where each word differs 
from its successor by a cyclic shift of a prefix (ignoring the first bit which is always one). Moreover, each 
transition changes either two or four bits, see figure[15.3-A. 


The (loopless) algorithm described in [292] can generate slightly more general objects: strings of t ones 
and s zeros where the number of zeros in any prefix does not exceed the number of ones. Paren strings 


correspond to t = s. The generator is implemented as follows [FXT: |comb/paren-pref.h : 


class paren. pref 
1 
public: 
const ulong t , s_; // t: number of ones, s: number of zeros 
const ulong nq.; // aux 
ulong x_, y.; // aux 
ulong *b_; // array of t ones and s zeros 
public: 


paren pref(ulong t, ulong s) 


// Must have: t >= s > O 
: t (0), s. (s), nq_(s+t-(s==t)) 
1 
b. = new ulong[s *t *1]; // element [0] unused 
first; 


“paren_pref() 4i delete [] b_; } 
const ulong * data() const { return b *1; } 
void first() 

i for (ulong j=0; j<=t_; ++j) b_[j] 


for (ulong j=t_+1; j<=s_+t_; ++j) b. 
x.—y =t 


jl = 0; 


The method for updating is 
Poot next () 
if (x »-nq ) return false; 


0; 
1; 
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++y_; 
T C b. [x ] == 0) 


if (x == 2ty_- 2) ++x_; 


return true; 


Note that the array b[] is one-based, as in the cited paper. A zero-based version is used if the line 
#define PAREN PREF BASE1 // default on (faster) 


near the top of the file is commented out. The rate of generation (with t = s = 18) is impressive: about 


268 M/s when using a pointer and about 281 M/s when using an array [FXT: comb/paren-pref-demo.cc). 
15.4 Catalan numbers 


'The number of valid combinations of n parentheses pairs is 


2n 2n+1 2n 
C. = B = ( n ) = neo — C ?) = ( 2n ) (15.4-1) 
n+1 2n+1 n n n—1 
as nicely explained in p.343-346]. These are the Catalan numbers, sequence A000108 in 312]: 
n: C, n: C, n: C, 
Ie 1 11: 58786 21: 24466267020 
2: 2 12: 208012 22: 91482563640 
3: 5 13: 742900 23: 343059613650 
4: 14 14: 2674440 24: 1289904147324 
5: 42 15: 9694845 25: 4861946401452 
6: 132 16: 35357670 26: 18367353072152 
T: 429 17: 129644790 27: 69533550916004 
8: 1430 18: | 477638700 28: 263747951750360 
9: 4862 19: 1767263190 29: 1002242216651368 
10: 16796 20: 6564120420 30: 3814986502092304 


The Catalan numbers are generated most easily with the relation 


2 (2n 4-1) 
Cua = m On 15.4-2 
+1 ET ( ) 
The generating function is 
1eT-4 = 
C(x) = —5— = Y Ona? = 140420450 414944204... (15.43) 
n=0 


The function C(x) satisfies the equation [a C(x)] = x + [x C(z)]^ which is equivalent to the following 
convolution property for the Catalan numbers: 


n—1 


Cn = SC, Cras (15.4-4) 
k=0 


The quadratic equation has a second solution (1+ /1—42)/(2z) =17*-1-x-21?-51*-142* 
which we ignore here. 
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Figure 15.5-A: The 55 increment-2 restricted growth strings of length 4 (left), the corresponding 3-ary 


Dyck words (middle), and positions of ones in the Dyck words (right). 
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15.5 Increment-: RGS, k-ary Dyck words, and k-ary trees 
We generalize the restricted growth strings for paren word by allowing increments at most 7: sequences 
a0, 01, ..., Gn Where ao = 0 and az € ax-1 +i. The case i = 1 corresponds to the RGS for paren words. 


A k-ary Dyck word is a binary word where each prefix contains at least k — 1 times many ones as zeros. 
The increment-4 RGS correspond to k-ary Dyck words where k = i + 1, see figure The positions 
of the ones in the Dyck words are computed as c; = k- j — aj (rightmost column). 


The length-n increment-i RGS also correspond to k-ary trees with n internal nodes: start at the root, 
move out by 1 positions for every one and follow back by one position for every zero. 


15.5.1 Generation in lexicographic order 


Figure|15.5-A| shows the increment-2 restricted growth strings of length 4. The strings can be generated 
in lexicographic order via [FXT: class dyck rgs in comb/dyck-rgs.h]: 


1 class dyck rgs 

2 { 

3 public: 

4 ulong *s_; // restricted growth string 
5 ulong n ;  // Length of strings 

6 ulong i_; // stk] <= s[k-11+i 

7 [--snip--] 

8 

9 ulong next() 

10 // Return index of first changed element in s[], 
11 // Return zero if current string is the last 
12 

13 * ulong k = n_; 

E start: 

1T if ( k==0 ) return 0; 

19 ulong sk = s [k] + 1; 

20 ulong mp = s_[k-1] + i.; 

21 if ( sk > mp) // "carry" 

22 { 

23 s_[k] = 0; 

24 goto start; 

25 } 

27 s_[k] = sk; 

28 return k; 

29 } 


30 [--snip--] 
The rate of generation is about 168 M/s for i = 1, 194 M/s for i = 2, and 218 M/s with i = 3 [FXT: 


comb/dyck-rgs-demo.cc|. 


15.5.2 Gray codes with homogeneous moves 


A loopless algorithm for the generation of a Gray code with only homogeneous moves is given in [37]. The 
RGS used in the algorithm gives the positions (one-based) of the ones in the delta sets, see figure|15.5-B 


(created with [FXT: ). An implementation is given in [FXT: class dyck gray 

in | 

A Gray code where in addition all transitions are two-close is shown in figure (created with [FXT: 

omb /dyek-gray2-demo.ce)} Note that the moves are enup-moves, compare to gure [6.6-B] on page [189] 
e underlying algorithm is described in [338] an implementation is given in [FXT: class dyck gray2 

in comb/dyckcgray 2h 


class dyck_gray2 
1 
public: 
ulong m, k; // m ones (and m*(k-1) zeros) 
bool ptt; // Parity of Total number of Tories (variable ’Odd’ in paper) 
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Figure 15.5-B: Gray code for 3-ary Dyck words where all changes are homogeneous. The left column 


shows the vectors of (one-based) positions, the symbol ‘A’ is used for the number 10. 


335 


15.5: Increment-1 RGS, k-ary Dyck words, and k-ary trees 


HAHAHA AAA AAA AAA AAA eo ee aAA 


direction 


o M o: e e 5; e o — ESI TM >o M e... e o o H e eH — — TM o M o: e oe e o ESI . 
u Mo: ooo - NU . - - o. Mo: rro MSc Moo of on on - . 
[e] +. o... ... - - He ee 05 5 c 5 559 o£ €T “oH - eH o: ee e’ - . 
Sof] n. n Ar oÓ ‘e’ - - . - Mc wd . Ao... - c 

eoo: o or n on . HAHAHA rr o: 5. Mo: f f n on on ee ee e’ awrite o: 
I P T o: rr 1 f] f. rr £] 7 5 ll Md se ew c€ - 
OW + c] o... ee os o on o of oh n nf n n o4 40808 seee ee oe ono 3-34 
DA Moo s eoe e e e eoe eo eo eoo momo oe om ot om om o m m o t n] n] tg n9 ng ——————————— Rm 
QAQ o HH LH HH I oH HH oH HH HH 800808 80 8 8 8 8 8 e 988 9o eomm omo os oom on o n tm n n t] ng 

A o ooo ooo ooo oo ooo ooo ooo A A A oo oo o oo A A A A A A A o o 


Lus Lo n o n o o n n n AAA AAA AAA AAA A AAA AAA os n AAA AAA AAA AAA AAA AAA AAA AAA os AAA AA AA AAA A AA AA 


o 00 oN LA 
*rd. C) 00 00) C0) 00) 00) 00) ST! STE STE ST ST ST CO CO (O (O Po LO LO LO LO LO LO LO LO LO LO P- P- P- KO £O €O €O XO &O €O €O P- P- P- LO LO LO LO LO ST! Y Y Y ST ST 
AANA NN NNN NN NNN NN AA Gs ss sr sS RP OD 00 00 00 00 00 00 00 09 00 09 00 09.00 09 00.09 EN 
ape ges ane ear gue sspe um tasti pre he ela ee el entr AA A A e(t 


LLL ILL IL JL CL IL CL IL JL JL IL IL JL JL JL CL IL JL JL JL IL IL CL IL JL JL JL CL IL JL JL IL JL IL JL JL IL JL LC IL JL JL JL LL JL JL JL JL IL JL JL JL 


T CN 00 sf! LO XO [—- 00 O) O 3 CN OD SP LO ON ODO 
To oo HH HL HON 


Figure 15.5-C: Gray code for 3-ary Dyck words where all changes are both homogeneous and two-close. 
The left column shows the vectors of (one-based) positions, the symbol ‘A’ is used for the number 10. 
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6 ulong *c_;  // positions of ones (1-based) 
7 ulong *e_; // Ehrlich array (focus pointers) 
8 bool *p_; // parity (1-based) 
9 int *s_; // directions: whether last/first (==0) or 
10 // rising (>0) or falling (<0); (1-based) 
B public: 
13 dyck_gray2(ulong tk, ulong tm) 
14 // must have tk>=2, tm>=1 
15 { 
16 k = tk; 
17 m = tm; 
18 ptt = false; 
19 c. = new ulong[m+2] ; 
20 // sentinels c [0] (with computing MN) and c_[m+1] (with condition in next()) 
21 
22 e. = new ulong[m+1]; 
23 p- = new bool[m*1]; // p. [0] unused 
24 S. = new int[m*1]; // s. [0] unused 
25 firstO; 
26 } 
27 
28 ~dyck_gray2() 
29 [--snip--] 
30 
31 void first 
32 1 
33 for (ulong j=0; j<=m; ++j) e_[j] = j; // {e_[j] = j for 0 <= j <= m 
34 for (ulong j=0; j<=m; ++j) s_[j] = 0; // ís.[j] = 0 for 1 <= j <= m 
35 for (ulong j=0; j<=m; **j) p. [jl = false; // ip [jl = 0 for 1 <= j <= m} 
36 for (ulong j=0; j<=m; ++j) c_[j] = j; // first word -- [1, 2, 3, , m] 
37 c_[m+1] = 0; // sentinel, c [0] is also sentinel 
38 } 
39 
The following comments in curly braces are from the paper: 
1 ulong next () 
2 // Return zero if current==last, else 
3 // position (!=0) in (zero-based) array c [] 
4 // (the first element never changes). 
5 1 
6 ulong i = e [m]; // The pivot 
7 if ( i==1 ) return 0; // current is last 
8 const ulong MN = c [i-1] + 1; // {MN is the minimum value of c_[i]} 
i // can touch sentinel c. [0] 
11 const ulong MX = (i - 1)*k + 1; // { MX is the maximum value of c [i]) 
12 
T - (s [i] == 0 ) // { c_[i] is at its first value } 
15 p-[i] = ptt; // { parity of total number of tories } 
16 s_[i] = +1; // {c_[i] starts rising unless it starts at max(i)} 
17 
i it ( c_[i] == MX ) // {one of these tories is not to c_[i]’s left} 
20 p_[i] = 1 - p [il]; 
21 s [i] = -s. [il; 
22 } 
23 
25 = ( c_[iti] == MX+k ) // can touch sentinel c [m*i]-- 
5 
26 p-[il = 1 - p. [il]; 
27 } 
28 } 
29 
30 if ( s_[i] >0) // { c_[i] is rising } 
31 { 
e if (c [i] == MN ) // {MN is taken and c. [i] can’t end there} 
34 s [i] = 2; 
35 } 
36 else 
37 1 
59 i ( Cc [i] == MN+1) && (s [i] == 2) ) // {MN+1 is also taken} 


40 s [i] = 3; 
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} 
} 
44 c [i] += C 1 + ( ((c_[i] % 2) == p_[i]) && (c. [i] < MX-1) ) D; 
if (c [i] == MX ) // {one more tory} 
1 
ptt = 1 - ptt; 
s [i] = -s [il; 
" } 
else // { c_[i] is falling } 
{ 
if (c [i] == MX ) { ptt = 1 - ptt; } // {one fewer tory} 
57 c [i] -= (1+ Cc [i] % 2) != p. [i] >) && (c. [i] > MN+1) ) D; 
} 
e_[m] = m; // {beginning to update Ehrlich array} 
- (c [i] + s [i] == MN-1 ) // {c_[i] is at its last value} 
s [i] = 0; // {c_[i] will be at its first value the next time i is the pivot} 
64 e [i] = e [i-1]; 
e [i-1] = i - 1; 
return i - 1; // position in zero-based array c. [] 


const ulong *data() const { return c *1; ) // zero-based array 


E 


15.5.3 The number of increment-; RGS 


12 3 4 5 6 T 8 9 10 11 
12 5 14 42 132 429 

1312 55 273 1428 7752 8414640 
E 225568798 


22 140 969 7084 53820 420732 3362260 27343888 
35 285 2530 23751 231880 2330445 23950355 250543370 2658968130 


BeBe HEH 
Wow te tl 
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Figure 15.5-D: The numbers Cn, of increment-i RGS of length n for i < 4 and n < 11. 


The number C,, ; of length-n increment-i strings equals 
Chit = Cu) (15.5-1) 
i in+1 
A recursion generalizing relation [15.4-2] is 


Mi [+n k] 
Il lin+k+1] 


Cari = (i+1) Chi (15.5-2) 


The sequences of numbers of length-n strings for i = 1,2,3,4 start as sown in figure|15.5-D| These are 


respectively the entries A000108, A001764, A002293, A002294 in [312] where combinatorial interpreta- 


tions are given. We can express the generating function C;(x) as a hypergeometric series (see chapter [36] 


on page [685): 
Ciz) = y Cha x" (15.5-3a) 
n=0 
1/(i 4- 1), 2/(i - 1), 3/ +1), ..., (i - 1)/(i 4 1) | G4) 
Lo PUED 2/0, 2/6), Ge 2/GRD| G6 aes gy 
2/1, 3/1, ..., ifi, (i+ 1)/i i 

Note that the last upper and second last lower parameter cancel. Now let f;(x) := x C;(a’), then 

fila) fila = 2 (15.5-4) 


That is, f;(z) can be computed as the series reversion of z — z**!. We choose i = 2 as an example: 
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? ti=serreverse(x-x"3+0(x” (17))) 

X + x73 + 3xx"5 + 12xx"7 + 55*x79 + 273*xx"11 + 1428*x^13 + 7752*x"15 + 0(x717) 
? t2=hypergeom([1/3,2/3,3/3] , [2/2,3/2] ,3°3/272*x)+0(x717) 

1 + x + 3x72 + 12*x^3 + 55*xx"4 + 273*x75 + 1428*x^6 + 7752*x"7 + ... + O(x717) 
? f-x*subst(t2,x,x^2); 


1't1- 
0(x^17) W f is actually the series reversion of x-x^3 
© o f-f^3 
x + 0(x735) NN... so f - f^3 == id 
We further have the following convolution property which generalizes relation |15.4-4 
Cus = P» Cir, i Cia, i Cja,i = Ojai Ojappi (15:5-8) 


Jı +j2 +... + hiba =n 1 


An explicit expression for the function C;(x) is 


1 Q/(itl)n\ 2” 
i = — 15.5- 
C;(a) [El n )£) (15.5-6) 
The expression generalizes a relation given in [227] rel.6] (set i = 1 and take the logarithm on both sides) 
= 1/2 1-y1-4 
y - ( ") 2” = 2log =) (15.5-7) 
— a 22x 
A curious property of the functions C;(x) is given in [349] entry “Hypergeometric Function” ]: 
; 1 
C; (z (1 -a)) E (15.5-8) 
1-2 
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Chapter 16 


Integer partitions 


1: 6 == 6* 1+0 +0 +0 +0 +0 == 1+1+1+1+1>+1 
2: 6 == 4* 1+ 1 2*0 * 0 +0 +0 == 1+1+1+1+2 
3: 6 == 2* 1+ 2 2+ 0 +0 +0 +0 mal 1+1+2+42 

4: 6 == 0 + 3x 2+ 0 +0 +0 +0 = 2+2+2 

5: 6 == 3* 1+0 + 1*3 +0 +0 +0 == 1+1+1+3 

6: 6 == ix 1+ 1x 2 + 1x 3+0 +0 +0 == l4 2*3 

des 6== O0 +0 + 2x 3+ 0 +0 +0 == 3+3 

8: 6 == 2* 1+0 + 0 + 1x 4+0 + 0 == 1c 1-4 

9: 6 == 0 + 1* 2+ 0 + 1* 4+0 + 0 == 2+4 

10: 6 == 1*x1+0 +0 +0 + 1x 5 +0 == 1 +5 

11: 6== 0 * 0 + 0 + 0 + 0 + 1x6 == 6 


O0 -1 O» Cu C5 2 


Figure 16.0-A: All (eleven) integer partitions of 6. 


An integer z is the sum of the positive integers less than or equal to itself in various ways. The decom- 
positions into sums of integers are called the integer partitions of the number x. Figure shows all 
integer partitions of z — 6. 


16.1 Solution of a generalized problem 


We can solve a more general problem and find all partitions of a number x with respect to a set V = 
(vo, v1,..., Un 1] where v; > 0, that is all decompositions of the form x = py Ck Uy Where cj; > 0. 
The integer partitions are the special case V = (1,2,3,...,n]. 

To generate the partitions assign to the first bucket ro an integer multiple of the first element vg: ro = cvo. 
This has to be done for all c > 0 for which rg < x. Now set co = c. If ro = x, we already found a 
partition (consisting of cg only), else (if rg < x) solve the remaining problem where a’ := x — co: vg and 
V! = {v1, v2, sae ,Un-i]- 


A C++ class for the generation of all partitions is [FXT: class partition gen in comb/partition-gen.h|: 


class partition gen 
// Integer partitions of x into supplied values pv[0],...,pv[n-1]. 
// pv[] defaults to [1,2,3,...,x] 
1 
public: 
ulong ct ; // Number of partitions found so far 
ulong n. ; // Number of values 
ulong i.; // level in iterative search 
long *pv_; // values into which to partition 


ulong *pc.; // multipliers for values 


ulong pci_; // temporary for pc [i ] 
long *r_; // rest 

long ri. ; // temporary for r [i.] 
long x. ; // value to partition 


partition gen(ulong x, ulong n-0, const ulong *pv-0) 


if (0-n) n= Xx; 
n =n; 
pv. = new long[n -*1]; 


pv_[j] 
pv_[j] 


0; 
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22 if (pv) for (ulong j=0; j<n_; ++3) 
23 else for (ulong j=0; j<n_; ++j) 
24 pc_ = new ulong[n *1]; 
25 r_ = new long[n *1]; 
26 init(x); 
2T } 
28 
29 void init(ulong x) 
30 { 
31 X =X; 
32 ct_ = 0; 
33 for (ulong k=0; k<n_; ++k) pc [k] = O 
34 for (ulong k-0; k<n_; ++k) r [k] = 0; 
35 r_[n_-1] = x ; 
36 r_[n_] = x_; 
37 iL. =n_- 1; 
38 pci_ = 0; 
39 ri = x_; 
40 } 
41 
42 “partition_gen() 
43 
44 delete [] pv_; 
45 delete [] pc_; 
46 delete [] r_; 
47 
48 
49 ulong next(); // generate next partition 
50 ulong next_func(ulong i); // aux 
51 [--snip--] 
52 J}; 
The routine to compute the next partition is given in [FXT: : 
1  ulong 
2 partition, gen::next() 
3 1 
s if (i »-n ) return n; 
6 r_[i_] = ri_; 
T pc.[i.] = pci_; 
a i. = next func(i ); 
10 for (ulong j=0; j<i_; **j) pc_[j] = r. [jl 
lj tti. ; 
13 ri = r [i ] - pv li]; 
14 pci. = pc [i] + 1; 
15 
16 return i - 1; // >=0 
17 $ 
1 ulong 
2 partition_gen: :next_func(ulong i) 
3 
4 start: 
5 if ( O!=i ) 
6 { 
7 while ( r_[i]>0 ) 
8 1 
9 pc.[i-1] = 0; 
10 r [i-1] = r_[il; 
11 --i; goto start; // iteration 
12 } 
13 
is else // iteration end 
16 if ( O!-2r [i] > 
17 1 
18 long d = r [i] / pv. [i]; 
19 r [i] -= d * pv. [i]; 
20 pc_[il = d; 
21 
22 } 
23 
24 if ( O--r [i] ) // valid partition found 
25 1 
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26 ++ct_; 
27 return i; 
28 } 
30 ++i; 
31 if ( i»n ) return n_; // search finished 
32 
33 r [i] -= pv. [i]; 
34 ++pc_[i]; 
35 
36 goto start; // iteration 
37 $ 


BwNr 


The routines can easily be adapted to the generation of partitions satisfying certain restrictions, for 
example, partitions into distinct parts (that is, c; < 1). 


The listing shown in figure |16.0-A| can be generated with [FXT: comb/partition-gen-demo.cc|. The 


190, 569, 292 partitions of 100 are generated at a rate of about 18 M/s. 


16.2 Iterative algorithm 


An iterative implementation for the generation of the integer partitions is given in [FXT: class 


partition in comb/partition.h|: 


class partition 
1 
public: 
ulong *c_; // partition: c[1]* 1 + c[2]* 2 * ... + c[n]|* n = n 
ulong *s_; // cumulative sums: s[jt1] = c[1]* 1 + c[2]* 2 * ... + c[j]* j 
ulong n. ; // partitions of n 
public: 
partition(ulong n) 
1 
n =n; 
c. = new ulong[n+1]; 
s. = new ulong[n*1]; 
s [0] = 0; // unused 
c [0] = 0; // unused 
first; 
F 
“partition() 
delete [] c_; 
delete [] s_; 
void first( 
c_[1] = n_; 
for (ulong i-2; i<=n_; i++) 4 c [i] = 0; $ 
s [1] = 0; 
for (ulong i-2; i«-n ; i++) (s [i] n ; } 
} 
ane last () 
for (ulong i-1; i«n ; i++) íc. [i] = 0; } 
c [n ] = 1; 
for (ulong i-1; i«n ; i++) ís. [i] = 0; } 
// s. [n *1] = n ; // unused (and out of bounds) 
To compute the next partition, find the smallest index ¿> 2 so that [c1, c2, ..., ci 1, Ci] can be replaced 
by [2,0,0,...,0, c; + 1] where z > 0. The index i is determined using cumulative sums. The partitions 


are generated in the same order as shown in figure|16.0-A| The algorithm was given (2006) by Torsten 


Finke [priv. comm.]. 
bool next () 
{ 


if (c [n ]!20) return false; 


// last == 1* n (c[n]==1) 
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5 // Find first coefficient c[i], i>=2 that can be increased: 
6 ulong i = 2; 
T while ( s_[i]<i ) ++i; 
9 ++c_[il; 
10 s [i] -= i; 
11 ulong z = s. [i]; 
12 // Now set c[1], c[2], ..., c[i-1] to the first partition 
13 // of z into i-1 parts, i.e. set to z, 0, 0, ..., 0: 
14 while ( --i> 1) 
15 1 
16 s [i] = z; 
17 c [i] = 0; 
18 } 
19 c_[1] = z; // zx 1 =z 
20 // s_[1] unused 
3) return true; 
23 $ 
The preceding partition can be computed as follows: 
1 bool prev() 
2 { 
1 if (c [1]-7n | ) return false; // first == n* 1 (c[1]--n) 
5 // Find first nonzero coefficient c[i] where i»-2: 
6 ulong i - 2; 
i while ( c_[il==0 ) ++i; 
9 --c_[il; 
10 s_[i] += i; 
11 ulong z = s. [i]; 
12 // Now set c[1], c[2], ..., cli-1] to the last partition 
13 // of z into i-1 parts: 
14 while ( --i > 1 ) 
15 { 
16 ulong q = (z>=i ? z/i : 0); // == z/i; 
17 c-[i] = q; 
18 s_[i+1] = z; 
19 z -= q*i; 
20 } 
21 c_[1] = z; 
22 s_[2] = z; 
23 // s_[1] unused 
33 return true; 
26 
27 [--snip--] 
28 }; 
Divisions which result in q = 0 are avoided, leading to a small speedup. The program [FXT: 


comb/partition-demo.cc, demonstrates the usage of the class. About 200 million partitions per second 


are generated, and about 70 million for the reversed order. 


16.3 Partitions into m parts 
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Figure 16.3-A: The 22 partitions of 19 into 11 parts in lexicographic order. 


An algorithm for the generation of all partitions of n into m parts is given in vol2, p.106]: 


&O 00 -1 O» Ot i02 b2 A 


16.3: 


Figure 
gram |[F XT: 


comb/mpartition.h!: 


Partitions into m parts 


The initial partition contains m—1 units and the element n—m-+1. To obtain a new partition 
from a given one, pass over the elements of the latter from right to left, stopping at the first 
element f which is less, by at least two units, than the final element [...]. Without altering 
any element at the left of f, write f +1 in place of f and every element to the right of f with 
the exception of the final element, in whose place is written the number which when added 
to all the other new elements gives the sum n. The process to obtain partitions stops when 
we reach one in which no part is less than the final part by at least two units. 


comb/mpartition-demo.cc|. 


class mpartition 

// Integer partitions of n into m parts 

public: 
ulong *x ; // partition: x[1]+x[2]+...+x[m] = n 
ulong *s_; // aux: cumulative sums of x[] (s[0]=0) 
ulong n;  // integer partitions of n (must have n>0) 
ulong m. ; // ... into m parts (must have 0<m<=n) 


mpartition(ulong n, ulong m) 
: n. (n), m. (m) 


x_ = new ulong [m *1]; 
_ = new ulong [m *1]; 
initO; 


[--snip--] 


void init() 


x [0] 2 0; 

for (ulong k=1; k«m ; ++k) x [k] = 1; 
x [n] n - m -* 1; 

ulong s = 0; 


for (ulong k-0; k<=m_; ++k) 4 st=x_[k]; s_[k]=s; $ 
} 


The successor is computed as follows: 


3 


bool next () 
1 


ulong u = x_[m_]; // last element 

ulong k = m_; 

while ( --k ) 4 if ( x_[k]+2<=u ) break; } 
if ( k==0 ) return false; 


ulong f = x_[k] + 1; 
ulong s = s [k-1]; 
while ( k < m_ ) 


x_[k] = f; 
s t= f; 
s_[k] = s; 
++K; 


x [n] = n_ - s_[m_-1]; 
// s_[m_] = n_; // unchanged 


return true; 


} 
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16.3-A| shows the partitions of 19 into 11 parts. The data was generated with the pro- 
The implementation used is [FXT: class mpartition in 


The auxiliary array of cumulative sums allows the recalculation of the final element without rescanning 
more than the elements just changed. About 134 million partitions per second are generated. A Gray 
code for integer partitions is described in [279], for algorithmic details see [215] sect.7.2.1.4]. 
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16.4 The number of integer partitions 


We give expressions for generating functions for various types of partitions, as, for example, unrestricted 
partitions, partitions into an even or odd number of parts, partitions into exactly m parts, partitions into 
distinct parts, and partitions into square-free parts. 


The following relations will be useful. The first is found by setting Po = 1 and Py = IL (1 + a4) so 
Py = (1 + an) Pn-1 = ay Pn-1 + Py—1 = an Pn-1 + an_—1 PN» + Py-2 and so on. For the second, 
replace an by an/(1 — an) (for the other direction replace a, by an/(1 + an)): 


N N = N N 
[[ Gta) = I tin I 1+ax) =1+ an || Q2) (16.4-1a) 
n=1 n=1 k=1 n=1 k=n+1 
1 3j a = a 
= [+ se = 14 ” (16.4-1b) 
maza) ^ "LME t TR daa 
The next two are given in [248] p.7, id.7 and id.6]: 
jas 9e n 4n (n—1)/2 
n zt q 
n=0 n=0 = 
1 oo Pu gay 
- — — (16.4-2b) 
ME o(1- rq” ) 2 “o (1— qqk ) “o (1—aqr) 
The relations are the limits M — oo of the following: 
M-1 M n—1 M-k 
Hao (1-4 ) (n—1)/2 
1 n = = Ty grin 16.4-3 
IT +2q”) » "Gag ^4 (16.4-3a) 
M n—1 M-k 
1 o (1-4 ) (n—1) 
— = e — argo (16.4-3b) 
Do (0-58) 2, io (1 — q4*) Tio (1 — ear) 


These relations are respectively the special cases (a,b) = (—1,0) and (a,b) = (0,1) of an identity due to 
Jacobi [194] p.795]: 

TT acq") Mezo (1 — 4274) [Ico (09% — a) 

Il = Y k=0 k=0 qe q” (n—1)/2 (16.4-4) 
(1-bzq") nao Ti=1 (1 — a) ko (1 — bmg") 


In the limit M — oo the first product in the numerator on the right is 1, setting a = —1 and b = 0 gives 


16.4-2a| setting a = 0 and b = 1 gives|16.4-2b| The following identity (given in p.70, rel.1.3] and 
p.19, rel.2.2.7]) is due to Cauchy, setting a = —1 and b = 0 gives |16.4-2a 


n=0 


ry (1-acq") kz (b-aq*) n 
Coar) = lr V709) 16.4- 
JI a = baq”) um IT (1 — q^) X ( 6 5) 


We will use two functions (eta functions, or n-functions) defined as 


nía) :— (1— 2”) (16.4-6a) 


ima 


3 
ll 
a 


LE 


ny (a) :— Hose) (16.4-6b) 


3 
Il 
a 
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n: Pa n: Pry n: ip n: P, n: P, 
1: 1 11: 56 21: 792 31: 6842 41: 44583 
2: 2 129: 77 22: 1002 32: 8349 42: 53174 
3: 3 13: 101 23: 1255 33: 10143 43: 63261 
A: 5 14: 135 24: 1575 34: 12310 44: T5175 
5: T 15: 176 25: 1958 35: 14883 45: 89134 
6: 11 16: 231 26: 2436 36: 17977 46: 105558 
T: 15 17: 297 27: 3010 37: 21637 47: 124754 
8: 22 18: 385 28: 3718 38: 26015 48: 147273 
9: 30 19: 490 29: 4565 39: 31185 49: 173525 

10: 42 20: 627 30: 5604 40: 37338 50: 204226 


Figure 16.4-A: The number P,, of integer partitions of n for n < 50. 


n: P(n) P(n,m) for m = 

i A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
2: 2 1 1 

3: 3 1 1 1 

4: 5 1 2 1 1 

5: T 1 2 2 1 1 

6: 11 1 3 3 2 1 1 

T: 15 1 3 4 3 2 1 1 

8: 22 1 4 5 5 3 2 1 1 

9: 30 1 4 T 6 5 3 2 1 1 

10: 42 1 5 8 9 7 5 3 2 1 1 

11: 56 1 5 10 11 10 7 5 3 2 1 1 

12: 77 1 6 12 15 13 11 7 5 3 2 1 1 

13: 101 1 6 14 18 18 14 11 T 5 3 2 1 1 

14: 135 1 7 16 23 23 20 15 11 T 5 3 2 1 1 

15: 176 1 7 19 27 30 26 21 15 11 T 5 3 2 1 1 

16: 231 1 8 21 34 37 35 28 22 15 11 7 5 3 2 1 1 


Figure 16.4-B: Numbers P(n,m) of partitions of n into m parts. 


16.4.1 Unrestricted partitions and partitions into m parts 


The number of integer partitions of n is sequence A000041 in [312], the values for 1 € x < 50 are shown 
in figure|16.4-A| If we denote the number of partitions of n into exactly m parts by P(n, m), then 


P(n,om) = P(n—1,m- 1) 4 P(n — m,m) (16.4-7) 


where we set P(0,0) = 1. We obviously have P, = $7» , P(n,m). Figure |16.4-B| shows P(n, m) for 
n < 16. It was created with the program [FXT: The number of partitions 


comb/num-partitions-demo.cc|. 


into m parts equals the number of partitions with maximal part equal to m. This can easily be seen by 
drawing a Ferrers diagram (or Young diagram) and its transpose as follows, for the partition 5+2+2+1 
of 10: 


43111 5221 

5 xxxxx 4 XXXX 

2 xx 3 xxx 
2 xx 1x 
1 x 1 x 
1 x 


Any partition with maximal part m (here 5) corresponds to a partition into exactly m parts. The 
generating function for the partitions into exactly m parts is 


y P(n,om)z" = = 
n=1 


II a-25 (1— a5 (16.4-8) 


For example, the row for m = 3 in figure|16.4-B|corresponds to the power series 
? m-3; (x^m/prod(k-1,m,1-x^k)40(x^17)) 


x^8 + x74  2*x^b + S*x^6 + 4*x^7  5b*x^8 + 7*x"9 + 8*x^10 + \ 
10*x^11 + 12*x^12 + 14*x^13 + 16*x^14 + 19*x^15 + 21*x^16 + 0(x^17) 
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We have 


m0 7 LL Pmt ines 


n=1 n=1m=1 


The rows of figure|16.4-B| correspond to a fixed power of zx: 
? i/prod(n-1,N,1-u*x^n) 


1 + u*x + (u^2-* u)*x"2 + (u^3 + u^2 + u)*x^3 + (u^4 + u^3 + 2*u^2 + u)*x"4 
+ (u^5 + u^4 + 2*u^3 + 2*u^2 + u)*x^b + (u^6 + u^5b + 2*u^4 + 3*u^3 + 3*u^2 + u)*x^6 +... 


The generating function for the number P,, of integer partitions of n is found by setting u = 1: 


S But He = = wa) (16.4-10) 


The partitions are found in the expansion of 


1 
Me (1 — tk x) 


? N=5; z=*x+0(*x" (N+1)); 1/prod(k=1,N,1-eval(Str("t"k))x*z"k) 
1 + ti*x + (14172 + t2)*x^2 + (1173 + t2*t1 + t3)*x73 
+ (t1^4 + t2*t172 + t3xt1 + 1272 + t4)*x^4 
+ (t1^5 + t2*t173 + t3xt172 + (t272 + t4)*t1 + t3*t2 + t5)*x75 


(16.4-11) 


Summing over m in relation|16.4-8| we find that 


oo 
1 a” 


W 7 MAA a 


This relation also is the special case an = x” (and N — 00) of|16.4-1b|on page We also have (setting 
x = q in|16.4-2b) 


ge 


16.4-13 
S (= 1 ( 0-25) 


'The expression can also be found by observing that a partition can be decomposed into a square and two 
partitions whose maximal part does not exceed the length of the side of the square [176] sect.19.7]: 


43111 
HEXXX 
## 


XX 
x 


ENNO 


Let P(n, m,r) be the number of partitions of n into at most m parts with largest part r, then [17] ex.15, 
p.575] 


ES 2% n n n? 
az" y" q 
y P(nm,)g'z"y' = — (16.4-14) 
n,m,r=0 > T fio (L— 845) k-o (1 — 99) 


Euler's pentagonal number theorem is (see [41] and [16]): 


+00 oo 

n(x) = 5 pr Es A = 1+ 25 (2337 "i (3n—1)/2 a + a") (16.4-15a) 
n-—-—oo n=1 

= l-g- g tata? a? a a” 40% — 7? 370 42% 40°"... (16.4-15b) 


The sequence of exponents is entry |A001318 in [312], the generalized pentagonal numbers. 
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Further expressions for 7 are (set q := x and x := —z in relation |16.4-2a|for the first equality) 


eu (-1)* qn1)/2 oo gin tn (1 - agen) 
= TEHR 74 x= 16.4-16a 
oo 2 oo oo 
= y) gr i II @- 2) = Yi” [[ (1-2*) (16.4-16b) 
k=n+1 n=0 k=n+1 


Write n(x) = IL-o J(a?)*!) where J is defined by relation |38.1-2a| on page Then a divisionless 
expression for 1/7 is obtained via relation 38.1-11d|on page |728| 


- [I ll (1+ 56507 = II sic (=) (16.4-17) 
k=0 


k=0 j=0 


The sequences of the numbers of partitions into an even/odd number of parts start respectively as 


1, 0, 1, 1, 3, 3, 6, 7, 12, 14, 22, 27, 40, 49, 69, 86, 118, 146, 195, 242, 
0, 1; 1; 2; 2; 4, 5, 8, 10, 16, 20, 29, 37, 52, 66, 90, 113, 151; 190; 248, 


These are the entries A027193/A027187 in [312]. Their generating functions are found by respectively 


setting a, = x?” and a, = z?"*! in|16.4-1b| (see relation |31.3-1c| on page |604| for the definition of O4): 
g 


Y mo Ll houses cay = HESE 164189) 
n=0 ut (1 — ak) 2 n(x) 7,4 (x) n(x) n=0 EC | 
oo gent 1 | 1 1 | ex usd 2 1-04(z) 
E 1 ov = (16.4-18b) 
2 maT oak) 21m) m) n) P» O T 
Adding the leftmost sums gives yet another expression for 1/7): 
Lo > 22 (1 — gon) y ant (164-19) 
"zx g ka d | 


This relation can be generalized by adding the generating functions for partitions into parts r + j for 
3=0,1,...,r— 1. For example, for r = 3 we have: 


1 oo q3n (1 _ eo) (1 _ ae] + g3ntl (1 E gu) + q3n*2 
= » Win gh) (16.4-20) 
n=0 k=1 


The Rogers-Ramanujan identities for the numbers of partitions into parts congruent to 1 or 4 (and 2 or 
3, respectively) modulo 5 are [176] sec.19.13, p.290]: 


oo 1 oo P 

S m KE 16.4-21 
lisas > Amas dien 
oo 1 oo x” +n 

= T HMM 16.4-21 
lI (1— 25942) (1 — 25043) £ [na -a5 (16 b) 


Many identities of this kind are listed in [311] and ES a generalization is given in [87]. The sequences 
of coefficients are entries A003114 and |A003106 in [312]: 


1, 1, 1, 1, 2, 2, 3, 3, 4, 5,6, 7, 9, 1 


s 0, 19, 23, 26, 31, 35, 41, 
1,0, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 6, 6, 


> 12; 15, 16, 20, 22, 26, 
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n: Dn n: D, n: D; qn. Dn n: Dn 
1 1 11: 12 21: 76 31: 340 41: 1260 
2 1 12: 15 22: 89 32: 390 42: 1426 
3 2 13: 18 23: 104 33: 448 43: 1610 
4 2 14: 22 24: 122 34: 512 44: 1816 
5 3 15: 27 25: 142 35: 585 45: 2048 
6 4 16: 32 26: 165 36: 668 46: 2304 
T 5 17: 38 27: 192 37: 760 47: 2590 
8 6 18: 46 28: 222 38: 864 48: 2910 
9: 8 19 54 29: 256 39: 982 49: 3264 

10: 10 20: 64 30: 296 40: 1113 50: 3658 


Figure 16.4-C: The number D,, of integer partitions into distinct parts of n for n < 50. 


16.4.2 Partitions into distinct parts 


The generating function for the number D,, of partitions of n into distinct parts is 


y DS = |] +e") m) (16.4-22) 
n=0 n=1 


The number of partitions into distinct parts equals the number of partitions into odd parts: 


mpra SED 1 
= mz) [Iga (1 291) 


The sequence of coefficients D,, is entry |A000009 in [312], see figure|16.4-C| The generating function for 


D(n, m), the number of partitions of n into exactly m distinct parts, is (see p.559]) 


(16.4-23) 


oo E qm (m+1)/2 
n=0 x 


Summing over m (or setting q = x in|16.4-2a) gives 
eo x” (n+1)/2 


ny (2) = 21, 4-5 (16.4-25) 


Equivalently, the Ferrers diagram of a partition into m distinct parts can be decomposed into a triangle 
of size m (m + 1)/2 and a partition into at most m elements: 


THHEHHEXXXXX HHHHH XXXXX 
TIHHEHEXXXX == HHHH + XXXX 
THHEX XXX HHH XXXX 
HHX ## x 

Hx # x 


The connection between relations|16.4-24|and}16.4-13}can be seen by drawing a diagonal in the diagram 
of an unrestricted partition: 


HXXXXXXX # XXXXXXX # XXXXXXX XXXX 
XHXXXXXX == x# XXXXXX == # + XXXXXX + XX 
XX#XXXXX xxt + XXXXX # XXXXX 

XX XX 

x x 


So each unrestricted partition is decomposed into a diagonal (of, say, m elements) and two partitions into 
either m or m — 1 distinct parts. The term corresponding to a diagonal of length m is 
2 "uS 
z" [D(n,m) + D(n,m- 1) = x "eU (16.4-26) 
M= (1 7 25)] 
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See [265] for a survey about proving identities using Ferrers diagrams. We also have 


$ > D(n, m) x" u™ (16.4-27) 


in 
e 
+ 
g 
8 


? prod(n=1,N,1+u*x*n) 
1 + u*x + u*x^2 + (u72 + u)*x^3 + (u^2 + u)*x^4 + (2xu^2 + u)*x^5 
+ (u73 + 2*u^2 + u)*x^6 + (u^3 + 3*u^2 + u)*x^7 + (2*u^3 + 3*u^2 + u)*x^8 
+ (34u73 + 4*u^2 + u)*x"9 + (u^4 + 4*u^3 + 4*u^2 + u)*x^10 +... 


The partitions into distinct parts can be computed as the expansion of 
oo 
(++. (16.4-28) 
k=1 


? N=9; z="x+0("x" (N+1)); 
? prod(k-1,N,1*eval(Str("t"k))*z^k) 
1 + ti*x + t2*x^2 + (t2*t1 + t3)xx"3 + (t3xt1 + t4)*x^4 
(t4*t1 + t3*t2 + tb)*x^5b + ((t3xt2 + t5)*ti + t4*t2 + t6)*x^6 
((t4x*t2 + t6)*t1 + t5*t2 + t4*t3 + t7)*x^7 
((t5xt2 + t4*t3 + t7)*t1 + t6*t2 + t5*t3 + t8)*x^8 
((t6*t2 + t5*t3 + t8)*t1 + (t4*t3 + t7)*t2 + t6*t3 + t5*t4 + t9)*x"9 


++++ 


Let E(n,m) be the number of partitions of n into distinct parts with maximal part m, then 


5 E(n, m) z^ 


m=0 


Summing over m (or setting an = z" and N — oo in relation |16.4-1a| on page |344) gives: 


"T (1 +2* (16.4-29) 


oo 


nia) = 1-»z I 1+3") (16.4-30) 


For the first of the following equalities, set q := x? in |16.4-2b| the second is given in [310] p.100]: 


2n?—n oo 2n? +n 


Tz x 
n(x) = " = - = (16.4-31) 
y » wai (1 — a) ail = “Er” ‘al "2 
Set x := —q in relation |16.4-2b|to obtain an expression for 1/9: 
1 oo ( y" 
—zx 
TI" A — 2k) (16.4-32) 
wc 7 Lia 


The sequences of numbers of partitions into distinct even/odd parts start respectively as (see entries 


A035457 and [A000700|in. [312]) 


1, 0, 1, 0, 1, 0, 2, 0, 2, 0, 3, 0, 4, O, 5, O, 6,0, 8, O, 10, O, 12, O, 15, 
1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 8, 8, 9, 11, 
The generating function for the partitions into distinct even parts is 
oo 4 
n (a*) 1 
(1-27) = 9, (24%) = n,(-2)n (+2) = E (16.4-33) 
H + (7) + + n (a2) II. (1 — 24" ¥?) 


The last equality tells us that the function also enumerates the partitions into even parts that are not a 
multiple of 4. Setting q := x? and x := 1 in|16.4-2a| gives 


[ete = Y Tan) (16.4-34) 
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The generating function for partitions into distinct odd parts is 


Tarn - tte) 00 1 a) _ (Y 
Tate") = 325 7 ne) 7 owe» 7 01 dd 


Also (for the first equality set q := 2? in relation |16.4-2a]: 


oo 


I[ 04279 -> LT a TM Gn) -5 AE (16.4-36) 


n=0 


The number of partitions where each part is repeated at most r — 1 times has the generating function 
Il (1Ha +a? ar. eat) an 2 1 
n(x) lso mod r (1 ak) 


n=1 
The second equality tells us that the number of such partitions equals the number of partitions into parts 
not divisible by r, equivalently, partitions into m parts where m is not divisible by r. 


(16.4-37) 


Replacing x by x^ and q by x” in relation|16.4-2b|gives an identity for the partitions into parts = r mod m 
(valid for 0 < r < m, for r = 0 replace z by 2™ in|16.4-13): 


1 ee: mn?+(r—m) n 


(16.4-38) 


T 
IE a _ gmntr) m 2. n—1 a — gm ktm) i a = mir) 


n=0 n=0 k=0 


The same replacements (where 0 € r < m) in relation |16.4-2a] give an identity for the partitions into 
distinct parts = r mod m: 


oo oo [mn?+(2r—m) n] /2 
14gmn+r) = uia 16.4-39 
I ) = aaah) ee) 


A generating function for the partitions into distinct parts that differ by at least d is 


gt (4, n) 
y =~ where T(d,n) := ET ou ai (16.4-40) 
uz is y (1 — x^) 2 


See sequences A003114) (d = 2), A025157 ) [4025158] A025158 (d = 4), A025159 (d = 5), /A025160 (d = 6), 
A025161 (d = 7), and homo d — ETE 5 “en EE e relation follows from cutting out a (incomplete) 
S 


tretched e in ME errers diagram ( mn for d = 2): 


dist. »- d-2 x^ (d* (nx (n+1))/2 - (d- D * 1/prod(...) 
XXXXXXXXXXXXX FAR X Xx WHHHHHHHHH XXXX 
XXXXXXXXX == TH == WHHHHHHH - W + XX 
XXXXXX THEHHHEX WHHEHH W x 

XXX HHH WHHH W 

x # Wi W 


The sequences of numbers of partitions into an even/odd number of distinct parts are entries |A067661 
and |A067659 in [312], respectively: 


1,0,0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7, 9, 11, 13, 16, 19, 23, 27, 32, 38, 45, 
O, 1, 1, 1, 1, 1, 2. 2. 3, 4, 5, 6, 8, 9, 11, 14, 16, 19, 23, 27, 32, 38, 44, 


The corresponding generating functions are 


oo g2 n 
C REC MN S oe (16.4-41a) 
2 n=0 11k=1 e =g ) 
m 2 
n(x) u n(x) qa +3n+1 q2n+l gan tn 
a M — — — 16.4-41b 
z eras hrena m Cen" 


Adding relations|16.4-41a|and|16.4-41bļ| gives the second equality in[16.4-31| subtraction gives the second 
equality in|16.4-16a 
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16.4.3 Partitions into square-free parts 1 


We give relations for the ordinary generating functions for partitions into square-free parts. The Mobius 
function y is defined in section |37.1.2| on page The sequence of power series coefficients is given at 
the end of each relation. 


Partitions into square-free parts (entry A073576|in [312]): 
as -A) 
Mean - I " (z^ >) (16.4-42) 


1, 1, 2, 3, 4, 6, 9, 12, 16, 21, 28, 36, 47, 60, 76, 96, 120, 150, 


Partitions into parts that are not square-free, note the start index on the right side product, (entry 


A114374): 


I 1-ü- - JI ala" T (16.4-43) 


n=2 


1, 0, O, O, 1, O, Oy. O, 2, 1, O, O, 3, 1, Bn 
11, 6, 4, 3, 15, 8, 6, 3, 22, 13, 11, 6, ad: 28. 9152 ER 46? de. 24, 


Partitions into distinct square-free parts (entry A087188): 


oo 


TI G+ um) - II AG 2^ m (16.4-44) 


n=1 
1, 1, 1, 2, 1, 2, 3, 3, 4, 4, 5, 6, 6, 8, 9, 10, 13, 14, 16, 18, 20, 


Partitions into odd square-free parts, also partitions into parts m such that 2m is square-free (entry 


A134345): 


r1 1 r1 1 
= AA = 16.4-45 
1 1 — u(2n — 1)? z?n-1 I" 1— p(2n)? z^ ( z 
—p(2n—1) 
oc ay oc 
7) (« n 12 tH@n=1) 
(ea) = II» (x? 1) ) (16.4-45b) 
n=1 n=1 


1, 1, 1, 2, 2, 3, 4, 5, 6, 7, 9, 11, 13, 16, 19, 23, 27, 32, 38, 44, 


Partitions into distinct odd square-free parts, also partitions into distinct parts m such that 2m is square- 


free (entry A 134337): 


oo oo oo n (x@n-»") eee 
+ 
H n=1 n=1 


1, 1, 0, 1, 1, 1, 1, 1, 2; 1, 1, 2, 2, 2, 2, 3, 4, 3, 4, 5, Sy 6, 6, Ta 


Partitions into square-free parts m Æ 0 mod p where p is prime: 


co p-1 y (arn) —H(pn—r) 


r1 1 
Mie... a 111 n (ar ten?) (16.4-47) 
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For example, partitions into square-free parts m Æ 0 mod 3: 
a 1 n 1 
I] = 7 II > (16.4-48a) 


2 Js 2 
ua n) n=1, n0 mod 3 1 p(n) il 


n=1 


oo 7 (26 ud puse n (ze dd zi: 


= — ML (16.4-48b) 
1 n (x3 (3 ay | n (x3 (3 cue 
1,1,2, 2, 3, 4, 5, 7, 8, 10, 13, 16, 20, 24, 30, 36, 43, 52, 61, 73, 86, 
Partitions into distinct square-free parts m Æ 0 mod p where p is prime: 
tB(pn-r) 
oo , oo p-1 ny (zem) 
n=1 mnelmrel n4 (x? E ) 
For example, partitions into distinct square-free parts m 4 0 mod 3: 
I[ @+e@ny2") = II (1+ p(n)? 2") = (16.4-50a) 
n=1 n=1, n40 mod 3 
aV q tH(3n—1) ay q +H(3n—2) 
II n+ ES ame ) une ES nc) ) TES 
n dna (x3 (3 dd Ny (x3 (3 enr ` 


1. 1; 1, 1. 0, 1; 1, 2; 2; 1, 2, 25; 3, 4, 4, 4, 4, 5, 6, Ty T, T, 9, 9; 12, 12, 


16.4.4 Relations involving sums of divisors 1 


The logarithmic generating function (LGF) for objects counted by the sequence cn has the following form: 


oo 
x 
n=1 


n 


(16.4-51) 


The LGF for o(n), the sum of divisors of n, is connected to the ordinary generating function for the 


partitions as follows (compare with relation 37.2-15a|on page|712]: 
»» ze) “= log (1/n(«)) (16.4-52) 
n=1 


We generate the sequence of the a(n), entry A000203 in [312], using GP: 


? N-25; L=ceil(sqrt(N))+1; x-'x*0C€x^N); 
? s-log(1/eta(x)) 
x + 3/2*x72 + 4/3*xx"3 + T/A*x^4 + 6/5xx"5 + ... 
? v=Vec(s); vector(#v,j,v[j]*j) 
[1, 3, 4, 7, 6, 12, 8, 15, 13, 18, 12, 28, 14, 24, 24, 31, 18, 39, 20, 42, 32, 36, 24, 60] 


Write o(n) for the sum of odd divisors of n (entry A000593). The LGF is related to the partitions into 
distinct parts: 


y NM = log(m,(x)) (16.4-53) 


? s-log(eta(x^2)/eta(x)) 
x + 1/2*x^2 + 4/3*x^3 + 1/A*x^4 + 6/5xx"5 + ... 
? v=Vec(s); vector(#v,j,v[j]*j) 
[1, 1, 4, 1, 6, 4, 8, 1, 13, 6, 12, 4, 14, 8, 24, 1, 18, 13, 20, 6, 32, 12, 24, 4] 
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Let s(n) be the sum of square-free divisors of n. The LGF for the sums s(n) is the logarithm of the 
generating function for the partitions into square-free parts: 


TL 


c. s(n)ar r3 n2\ HO) 
= deg (II n E ) (16.4-54) 
m=1 n=1 

The sequence of the s(n) is entry A048250 in [312]: 


? s-log(prod(n-1,L,eta(x^(n^2))^(-moebius (n)))) 
x + 3/2xx"2 + 4/3*x73 + 3/4*x"4 + 6/5xx"5 + ... 
? v=Vec(s);vector (#v,j,v[j]*j) 
[1, 3, 4, 3, 6, 12, 8, 3, 4, 18, 12, 12, 14, 24, 24, 3, 18, 12, 20, 18, 32, 36, 24, 12] 


A divisor d of n is called a unitary divisor if ged(d, n/d) = 1. We have the following identity, note the 
exponent —4(n)/n on the right side: 


Y 9T — tog (Iv) " m ) (16.4-55) 


n=1 


The sequence of the u(n) is entry A034448 


? s=(log(prod(n=1,L,eta(x” (n”2))” (-moebius (n)/n)))) 
x + 3/2*x^2 + 4/3*x^3 + 5/4*x°4 + 6/5*x75 + ... 
? v=Vec(s) ; vector (#v,j,v[j]*j) 
[1, 3, 4, 5, 6, 12, 8, 9, 10, 18, 12, 20, 14, 24, 24, 17, 18, 30, 20, 30, 32, 36, 24, 36] 


The sums u(n) of the divisors of n that are not unitary have a LGF connected to the partitions into 
distinct square-free parts: 


S mn) g” = 16g (ù r (ym (164-56) 


n=1 


The sequence of the sums u(n) is entry |A048146 


? s-log(prod(n-2,L,eta(x^(n^2))^ (+moebius (n) /n))) 
1/2*x^4 + 3/4*x^8 + 1/3*x^9 + 2/3*x^12 + 7/8*x^16 + ... 
? v=Vec(st+’x); v[1]=0; \\ let vector start with 3 zeros 
? vector (#v,j,v[j]*j) 
[0, 0, 0, 2, 0, 0, 0, 6, 3, 0, 0, 8, 0, 0, 0, 14, 0, 9, 0, 12, 0, 0, 0, 24, 5, 0, 12] 


For the sums s(n) of the divisors of n that are not square-free we have the LGF 


X a(n) r1 n2\ THC) 
YS = dog (II " (x ) (16.4-57) 
n=1 n=1 

The sequence of the sums s(n) is entry A162296 


? s-log(prod(n-2,L,eta(x^(n^2))^ (+moebius (n)))) 
x"4 + 3/2*x^8 + x79 + 4/3*x^12 + 7/4*x"16 + ... 
? v=Vec(st+’x); v[1]=0; \\ let vector start with 3 zeros 
? vector (#v,j,v[j]*j) 
[0, 0, 0, 4, 0, 0, O, 12, 9, O, O, 16, O, O, O, 28, O, 27, 0, 24, O, O, O, 48, 25, 0, 36] 
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Chapter 17 


Set partitions 


For a set of n elements, say Sn :— (1, 2, ..., n], a set partition is a set P = {8 , S2, ..., sy) of nonempty 
subsets s; of Sn whose intersection is empty and whose union equals S,,. 


For example, there are 5 set partitions of the set S3 = {1, 2, 3}: 


1: (11,2, 3} } 

2: { {1, 2}, {3} > 
3: { {1, 3}, {2} } 
4: { {1}, {2, 3} } 
5: { {1}, {2}, {3} } 


The following sets are not set partitions of $3: 


{ (1,2, 3}, {1} } // intersection not empty 
( {1}, {3} + // union does not contain 2 


As the order of elements in a set does not matter we sort them in ascending order. For a set of sets we 
order the sets in ascending order of the first elements. The number of set partitions of the n-set is the 


Bell number B,,, see section [17.2 on page 358 


17.1 Recursive generation 


We write Z,, for the list of all set partitions of the n-element set S,,. To generate Z,, we observe that with 
a complete list Z;, , of partitions of the set S,,-¡ we can generate the elements of Z, in the following 
way: For each element (set partition) P € Z,, 1, create set partitions of S» by appending the element n 
to the first, second, ..., last subset, and one more by appending the set {n} as the last subset. 


For example, the partition {{1, 2}, (3, 4}} € Z4 leads to 3 partitions of 55: 


P = (41, 2}, (3, 4) } 
=> { {1, 2, 5), (3, 4) } 
=> { {1, 2}, 43, 4, 5} } 
==> { {1, 2), (3, 4}, {5} + 


Now we start with the only partition {{1}} of the 1-element set and apply the described step n — 1 times. 
The construction (given in [261] p.89]) is shown in the left column of figure |17.1-A| the right column 
shows all set partitions for n = 5. 


A modified version of the recursive construction generates the set partitions in a minimal-change order. 
We can generate the ‘incremented’ partitions in two ways, forward (left to right) 


P = (41, 2}, (3, 4) } 
=> { {1, 2, 5), (3, 4} } 
=> { {1, 2}, 43, 4, 5} } 
=> { {1, 23, (3, 4}, {5} + 


or backward (right to left) 


P= (1, 2}, (3, 4} } 
- 1 {1, 2), (3, 4), {5} } 
--> { {1, 2}, (3, 4, 5} } 
=> { {1, 2, 5), {3, 4} } 


17.1: Recursive generation 355 


Teen IG Hec EUR setpart(4) - 
pi={1} ic 41, By 9: 
--» p={1, 2} 2: ü, 2, 3}, {4} 
--> p={1}, {2} 3: {1, 2, 4}, {3} 
SEN TA A 4: {1, 2}, {3, 4} 
pi=11, 2} 5: {1, 2}, {3}, {4} 
==> p={1, 2, 3} 6: {1, 3, 4}, {2} 
--> p={1, 2}, {3} 7: {1, 3}, {2, 4} 
pi={1}, {2} 8: {1, 3}, {2}, {4} 
--> p={1, 3}, {2} 9:  {1, 4}, (2, 3} 
--> p={1}, {2, 3} 10: {i}, {2, 3, 4} 
--> p={i}, (2), {3} 11: (D. (2, 3), {4} 
fee a MN dH ESSA 12: 11, 4, {2}, {3} 
pi={1, 2, 3} 13: {1}, {2, 4}, {3} 
--> p={1, 2, 3, 4} 14: {1}, {2}, {3, 4} 
--> p={1, 2, 3}, {4} 15: {1}, {2}, {3}, {4} 


pi={1, 2}, {3} 
--> p={1, 2, 4}, {3} 
--> p=11, 2}, 13, 4} 
--> p=11, 2}, {3}, {4} 
pi={1, 3}, {2} 
ano p={1, 3, 4}, {2} 
--> p={1, 3}, {2, 4} 
--» p={1, 3}, {2}, {4} 
pi={1}, {2, 3} 
--> p=11, 4}, (2, 3} 
--» p={1}, (2, 3, 4} 
--» p={1}, (2, 3}, (43 
pi={1}, {2}, {3} 
--» p={1, 4}, {2}, {3} 
--» p={1}, (2, 4), {3} 
--» p={1}, {2}, {3, 4} 
--> p={1}, {2}, {3}, {4} 


Figure 17.1-A: Recursive construction of the set partitions of the 4-element set S4 = {1, 2, 3, 4} (left) 
and the resulting list of all set partitions of 4 elements (right). 


O He Re SSS up setpart(4)-- 
P={1} P={1, 2, 3} {1, 2, 3, 4} 
--» {1, 2} --> (1, 2, 3, 4} {1, 2, 3}, {4} 
--> til {2} --> {1, 2, 3}, {4} 115 2}, {3}, {4} 
t1, 2), (3, 4} 
P={1, 2), {3} 11, 2, 4, {3} 
--» (1, 2}, {3}, {4} {1, 4}, {2}, {3} 
--» (1, 2}, {3, 4} {1}, {2, 4}, {3} 
--> {1, 2, 4}, {3} {1}, {2}, (3, 4} 
{1}, {2}, {3}, {4} 
EE P={1}, {2}, {3} {1}, (2, 3}, {4} 
P={1, 2} --» {1, 4), {2}, {3} {1}, (2, 3, 4} 
--> {1, 2, 3} --> {1}, {2, 4}, {3} {1, 4}, {2, 3} 
--» (1, 2j, {3} --» (1), {2}, {3, 4} (1, 3, 4), {2} 
--» {1}, {2}, {3}, {4} {1, 3}, {2, 4} 
P={1}, {2} {1, 3}, {2}, {4} 
-->{1}, {2}, {3} P={1}, (2, 3} 
-->{1}, {2, 3} --> {1}, (2, 3}, {4} 
-->{1, 3}, {2} =>> {1}, (2, 3, 4} 
--» (1, 4}, (2, 3} 
P-(1, 3}, {2} 


--» {1, 3, 4}, {2} 
--» {1, 3}, (2, 4} 
--> {1, 3}, {2}, {4} 


Figure 17.1-B: Construction of a Gray code for set partitions as an interleaving process. 
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1: {1, 2, 3, 4} 1: {1}, {2}, {3}, {4} 
2: (1, 2, 3}, {4} 2: {1}, {2}, (3, 4} 
3: 41, 2}, {3}, {4} 3: {1}, (2, 4), {3} 
4: {1, 2), 13, 4} 4: {1, 4}, {2}, {3} 
5: {1, 2, 4}, {3} 5: (1, 4}, (2, 3} 
6: (1, 4}, {2}, {3} 6: {1}, {2, 3, 4} 
7: {1}, {2, 4}, {3} 7: {1}, (2, 3}, {4} 
8: {1}, {2}, (3, 4} 8: (1, 3}, {2}, {4} 
9: {1}, {2}, {3}, {4} 9: 41, 3}, (2, 4} 
10: {1}, {2, 3}, {4} 10: 41, 3, 4), {2} 
11: {1}, {2, 3, 4} 11: {1, 2, 3, 4} 

12: 41, 4), {2, 3} 12: 41, 2, 3}, {4} 
13: 41, 3, 4), 12) 13: 41, 2), {3}, {4} 
14: (11, 3}, 12, 4} 14: 41, 2), 13, 4} 
15: (1, 3}, {2}, {4} 15: 41, 2, 4}, {3} 


&«oo-1oc0»cCu-c0tbr-— 


Figure 17.1-C: Set partitions of S4 = (1, 2, 3, 4} in two different minimal-change orders. 


'The resulting process of interleaving elements is shown in figure[17.1-B] The method is similar to Trotter’s 
construction for permutations, see figure[10.7-B]on page[253] If we change the direction with every subset 
that is to be incremented, we get the minimal-change order shown in figure [17.1-C] for n = 4. The left 
column is generated when starting with the forward direction in each step of the recursion, the right when 


starting with the backward direction. The lists can be computed with [FXT: comb /setpart-demo.cc.. 


The C++ class [FXT: class setpart in comb/setpart.h, stores the list in an array of signed characters. 
The stored value is negated if the element is the last in the subset. The work involved with the creation 


of Zn is proportional to dl k By where Bj is the k-th Bell number. 


The parameter xdr of the constructor determines the order in which the partitions are being created: 


class setpart 
// Set partitions of the set {1,2,3,...,n} 
// By default in minimal-change order 
ba 
ulong n_; // Number of elements of set (set = {1,2,3,...,n}) 
int *p_; // pl] contains set partitions of length 1,2,3,...,n 
int **pp ; // pp[k] points to start of set partition k 
int *ns_; // ns[k] Number of Sets in set partition k 
int *as_; // element k attached At Set (0<=as[k]<=k) of set(k-1) 
int *d ; // direction with recursion (*1 or -1) 
int *x ; // current set partition (==pp[n]) 
bool xdr.; // whether to change direction in recursion (==> minimal-change order) 
int dr0_; // dr0: starting direction in each recursive step: 
//  drO0-41 ==> start with partition {{1,2,3,...,n}} 
// | drO0--1 ==> start with partition {{1},{2},{3},...,{n}}} 
public: 
setpart (ulong n, bool xdr-true, int dr0=+1) 
1 


n = 0; 


ulong np = (n_*(n_+1))/2; // == \sum_{k=1}"{n}{k} 


p- = new int[npl; 

pp. = new int *[n_+1]; 

pp.[0] = 0; // unused 

pp-[1] = p.; 

for (ulong k-2; k«-n ; ++k) pp [k] = pp [k-1] + (k-1); 
ns. = new int[n_+1]; 

as, = new int[n *1]; 

d. = new int[n *1]; 

x_ = pp [n ]; 


init(xdr, dr0); 
} 
[--snip--] // destructor 


bool next() 4 return next_rec(n_); } 
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40 

Al const int* data() const 4 return x_; } 
42 

43 ulong print() const 

44 // Print current set partition 

45 // Return number of chars printed 

46 { return print p(n); } 

47 

48 ulong print_p(ulong k) const; 

49 void print_internal() const; // print internal state 
59 protected: 

52 [--snip--] // internal methods 

53 X 


The actual work is done by the methods next, rec() and cp. append O [FXT: (comb /setpart.cc|: 


int 


1 

2 setpart::cp_append(const int *src, int *dst, ulong k, ulong a) 
3 // Copy partition in src[0,...,k-2] to dst[0,...,k-1] 
4  // append element k at subset a (a>=0) 

5  // Return number of sets in created partition. 

6 

7 i ulong ct = 0; 

8 for (ulong j=0; j<k-1; ++j) 

9 { 

10 int e = src[jl; 

11 if (e>0O) dst[j] =e; 

12 else 

13 

14 if ( a==ct ) { dst[j]=-e; ++dst; dst[j]=-k; } 
15 else dst[j] = e; 

16 ++ct; 

17 } 

18 } 

19 if ( a»-ct ) { dst[k-1] = -k; ++ct; } 
39 return ct; 
22 $ 

1 int 

2 setpart: :next_rec(ulong k) 

3 // Update partition in level k from partition in level k-1 (k<=n) 
4  // Return number of sets in created partition 

5 t 

if ( k<=1 ) return 0; // current is last 

8 int d = d [k]; 

9 int as = as [k] + d; 

10 bool ovq = ( (d>0) ? (as>ns_[k-1]) : (as«0) ); 
11 if ( ovq ) // have to recurse 

12 

13 ulong nsi = next rec(k-1); 

is if ( O--ns1 ) return 0; 

16 d = ( xdr_ ? -d : dr0_ ); 

17 d_[k] = d; 

18 

19 as = ( (d>0) ? O : ns_[k-1] ); 

20 

21 as_[k] = as; 

22 

23 ulong ns = cp append(pp [k-1], pp .[kl, k, as); 
24 ns. [k] = ns; 

25 return ns; 

26 + 


The partitions are represented by an array of integers whose absolute value is < n. A negative value 
indicates that it is the last of the subset. The set partitions of S4 together with their ‘signed value’ 
representations are shown in figure|17.1-D| The array as[] contains a restricted growth string (RGS) 


with the condition a; < 1+max;<;(a;). A different sort of RGS is described in section 15.2 on page 325 


The copying is the performance bottleneck of the algorithm. Therefore only about 11 million partitions 
are generated per second. An O(1) algorithm for the Gray code starting with all elements in one set is 
given in |201|. 
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1: asl 0000] x[ +1 +2 +3 -4 ] {1, 2, 3, 4} 

2: as[0001] x[ +1 +2 -3 -4 ] 11, 2, 3}, {4} 
3: as[0010] x[ +1 +2 -4 -3 ] 11, 2, 4}, {3} 
4: as[ 0011] x[ +1 -2 +3 -4 ] {1, 2}, {3, 4} 
5: as[0012] x[ +1 -2 -3 -4 ] {1, 2}, {3}, {4} 
6: asL0 100 ] x[ +1 +3 -4 -2 ] {1, 3, 4}, {2} 
7: as[0101] x[ +1 -3 +2 -4 ] 11, 3}, (2, 4} 
8: as[0102] x[ +1 -3 -2 -4 ] 11, 3}, {2}, {4} 
9: as[0110] x[ +1 -4 +2 -3 ] 11, 4}, (2, 3} 
10: as[0 1 11 ] x[ -1 +2 +3 -4 ] {1}, (2, 3, 4} 
11 as[0112] x[ -1 +2 -3 -4 ] {i}, (2, 3}, {4} 
12 asL 0120 ] x[ +1 -4 -2 -3 ] 11, 4), {2}, {3} 
13 asL 0121] x[ -1 +2 -4 -3 ] {i}, (2, 4}, {3} 
14: asl 0122] x[ -1 -2 +3 -4 ] {1}, {2}, {3, 4} 
15 asL 0123] x[ -1 -2 -3 -4 ] {1}, {2}, {3}, {4} 


Figure 17.1-D: The partitions of the set S4 = (1, 2, 3, 4} together with the internal representations: 
the ‘signed value’ array x[] and the ‘attachment’ array as[]. 


17.2 The number of set partitions: Stirling set numbers and 
Bell numbers 


n: B(n) k: 1 2 3 4 5 6 T 8 9 10 
1: 1 1 

2: 2 1 1 

3: 5 1 3 1 

4: 15 1 T 6 1 

5: 52 1 15 25 10 

6: 203 1 31 9 65 

Ts 877 1 63 301 350 140 21 1 

8: 4140 1 127 966 1701 1050 266 28 1 

9: 21147 1 255 3025 7770 6951 2646 462 36 1 

10: 115975 1 511 9330 34105 42525 22827 5880 750 45 1 


Figure 17.2-A: Stirling numbers of the second kind (Stirling set numbers) and Bell numbers. 


The numbers S(n, k) of partitions of the n-set into k subsets are called the Stirling numbers of the second 
kind (or Stirling set numbers), see entry A008277 in [312]. They can be computed by the relation 


S(n,k) = kS(n-1,k)+S(n-1,k-1) (17.2-1) 


which is obtained by counting the partitions in our recursive construction. In the triangular array shown 
in figure each entry is the sum of its upper left neighbor plus k times its upper neighbor. The 
figure was generated with the program [FXT: comb/stirling2-demo.cc|. 

The sum over all elements S (n, k) of row n gives the Bell number Bn, the number of set partitions of the 


n-set. The sequence starts as 1, 2, 5, 15, 52, 203, 877, ..., it is entry |A000110 in [312]. The Bell numbers 
can also be computed by the recursion 


Born > (2) Bs (17.2-2) 


k=0 


As GP code: 


? N=11; v=vector(N); v[1]=1; 
? for (n=2, N, v[n]=sum(k=1, n-1, binomial (n-2,k-1)*v[k])); v 
[1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975] 


Another way of computing the Bell numbers is given in section |3.5.3 on page 151 


17.2: The number of set partitions: Stirling set numbers and Bell numbers 359 


17.2.1 Generating functions 


The ordinary generating function for the Bell numbers can be given as 


oo oo k 
Y Ba" = CQ. = 140420450 41504452054... — (172.3) 
n=0 k=0 IL (1— ja) 


The exponential generating function (EGF) is 


explexp(z)-1] = YU (17.2-4) 


n=0 


? sum(k-0,11,x^k/prod(j-1,k,1-j*x))40(x^8) AN OGF 
1 + x + 2*x^2 + 5*x73 + 15*x^4 + 52*x75 + 203*x^6 + 877*xx"7 + O(x78) 
? serlaplace(exp(exp(x)-1)) NN EGF 
1 + x + 2*x^2 + 5*x73 + 15*x^4 + 52*x75 + 203*x^6 + 877*x"7 + 4140*x^8 + ... 


Dobinski’s formula for the Bell numbers is [349] entry “Bell Number” ] 


1 n* 
Bh = > = (17.2-5) 
k=1 ` 


The array of Stirling numbers shown in figure can also be computed in polynomial form by setting 
Bolx) = 1 and 


Baila) = zx [Bh(r)+ Bn(z)| (17.2-6) 
The coefficients of B, (x) are the Stirling numbers and B,,(1) = Bn 
? B-1; for(k-1,6, B-x*(deriv(B)*B); print(subst(B,x,1),": ",B)) 
1: 

): x72 + 

5: 222 + ^72 * X... 

15: x + + 7*x72 + x 

52: x75 + 10*x74 + 25*x73 + 15*x72 + x n 
203: x^6 + ^b + 65*x74 + 90*x^3 + 31*x^2 + x 


The polynomials are called Bell polynomials, see [349] entry “Bell Polynomial” ]. 


17.2.2 Set partitions of a given type 


We say a set partition of the n-element set is of type C = [c1, €», C3, ..., Cn] if it has cı 1-element sets, 
C2 2-element sets, c3 3-element sets, and so on. Define 
oo k 
tkz 
La = m (17.2-7a) 
k-l ` 


then we have 


exp(L(z)) = Y 


n=0 


y (Znao Te) i (17.2-7b) 


C 
where Z, c is the number of set partitions of the n-element set with type C. 


? n-8;R-0(z^ (n*1)) ; 
? L=sum(k=1,n,eval(Str("t"k))x*z"k/k!)+R 

ti*z + 1/2*t2*z^2 + 1/6*t3*z^3 + 1/24*t4*z^4 + [...] + 1/40320*t8*z^8 + 0(z^9) 
? serlaplace(exp(L)) 


1 

+ ti *z 

+ (t172 + t2) *z^2 

+ (t173 + 3xt2x*t1 + t3) *z^3 

+ (t174 + 6*t2*ti^2 + 4*t3*t1 + 3xt272 + t4) *z"4 

+ (t175 + 10*t2*t173 + 10*t3*t1^2 + 15*t1*t2°2 + 5xt1xrt4 + 10*t3*t2 + t5) *z"5 

+ (t176 + 15*t2*t1°4 + 20*t3*t1^3 + [...] + 15*t2^3 + 15*t4*t2 + 10*t3^2 + t6) *z^6 

+ (1177 + 21*t2*ti^5 + 3b*t3*t1^4 + [...] + 105*t3*t2^2 + 21xt5*xt2 + 35*t4*t3 + t7) *z"7 
+ SN M 28*t2*t1^6 + b6*t3*t1^b + [...] + 28*t6*t2 + 56*t5*t3 + 35*t4^2 + t8) *z^8 

* 0(z^9 


Re 
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Specializations give generating functions for set partitions with certain restrictions. For example, the 
EGF for the partitions without sets of size one is (set tı = 0 and t; = 1 for k 4 1) exp (exp(z) — 1 — 2), 


see entry A000296 in [312]. Section |11.1.2| on page [278] gives a similar construction for the EGF for 


permutations of prescribed cycle type. 


17.3 Restricted growth strings 


For some applications the restricted growth strings (RGS) may suffice. We give algorithms for their 
generation and describe classes of generalized RGS that contain the RGS for set partitions as a special 
case. 


17.3.1 RGS for set partitions in lexicographic order 
The C++ implementation [FXT: class setpart rgs lex in comb/setpart-rgs-lex.h| generates the RGS 
for set partitions in lexicographic order: 


class setpart_rgs_lex 
// Set partitions of the n-set as restricted growth strings (RGS). 
// Lexicographic order. 


{ 
public: 
ulong n_; // Number of elements of set (set = {1,2,3,...,n}) 
ulong *m ;  // m[k+1] = max(s[O], s[1],..., s[k]) + 1 
ulong *s_; // RGS 
public: 
setpart rgs lex(ulong n) 
n =n; 
m. = new ulong[n_+1]; 
m [0] = ~OUL; // sentinel m[0] = infinity 
s. = new ulong[n ]; 
first; 
[--snip--] 


PERE RP RRR 
OON O) Ot i WN — CXO 00 -1 0» Cu NR 


T iud first() 

3 for (ulong k-0; k<n_; ++k) s [k] = 0; 
4 for (ulong k-1; k<=n_; ++k) m [k] = 1; 
5 F 

6 

d void last() 

8 1 

9 for (ulong k-0; k<n_; ++k) s [k] =k; 
0 for (ulong k-1; k<=n_; ++k) m [k] =k; 
1 F 


The method to compute the successor resembles the one used with mixed radix counting (see section [9.1] 
on page|217): find the first digit that can be incremented and increment it, then set all skipped digits to 
zero and adjust the array of maxima accordingly. 


bool next () 
{ 


if (m[n] == n_) return false; 

ulong k = n_; 

do { --k; } while ( (s_[k] + 1) > m [k] ); 
s [k] += 1UL; 


ulong mm = m_[k]; 
mm += (s_[k]>=mm) ; 
m [k*1] = mm; // == max2(m_[k], s_[k]+1) 


while ( ++k<n_ ) 


s [k] = 0; 
m [k*1] = mm; 


Re RRR pp i 
COON MOR WIN HA O OOND OR MIN 


return true; 
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20 } 
21 


The method for the predecessor is 


1 bool prev() 

2 { 

3 if (m[n] == 1 ) return false; 
4 ulong k = n_; 

$ do { --k; } while ( s_[k]==0 ); 
8 s_[k] -= 1; 

9 ulong mm = m [k*1] = max2(m_[k], s [k]*1); 
10 

11 while ( ++k<n_ ) 

12 { 

13 s [k] = mm; // == m[k] 

14 ++mm; 

15 m_[x+1] = mm; 

16 } 

H return true; 

19 } 


The rate of generation is about 157 M/s with next() and 190 M/s with prev () [FXT: 
rgs-lex-demo.cc] 


17.3.2 RGS for set partitions into p parts 


array of minimal values for n[] is [11123] 
1; s[...12] m[ 11123] {1, 2, 3}, {4}, {5} 
2: SE «.1..21 m[11223] (1, 2, 4}, {3}, {5} 
3: s[.. 112] m 11223] {1, 2}, {3, 4}, {5} 
4: Sle. 22. ] m[11233] (1, 2, 5), {3}, {4} 
5: sl. .121] m[11233] (1, 2), 13, 5}, {4} 
6: Sb... 122] m[ 11233] (1, 2}, {3}, (4, 5} 
T: s[.1..2] m[12223] (1, 3, 4}, {2}, {5} 
8: s[.1.12] m[ 12223] 11, 3}, 12, 4}, {5} 
9: s[.1.2.] m[12233] (1, 3, 5), {2}, {4} 
10: s[.1.21] m[12233] (1, 3}, (2, 5}, {4} 
11: s[.1.22] m[ 12233] (1, 3}, 12), 14, 5} 
12: s[.11.2] m[12223] (1, 4}, (2, 3}, {5} 
13: s[.1112] m[12223] {1}, (2, 3, 4}, {5} 
14: s[.112.1 m[ 12233] (1, 5}, 12, 3}, {4} 
15: s[.1121] m 12233] {1}, 12, 3, 5}, {4} 
16: sf .1122] m[12233] {1}, 12, 3}, (4, 5} 
17: SL . 12%] m[ 12333] (1, 4, 5}, {2}, {3} 
18: s[.12.1] m[12333] (1, 4}, 12, 5}, {3} 
19: s[.12.2] m[ 12333] (1, 4), 12), 13, 5} 
20: sle 1:21] m[ 12333] (1, 5}, 12, 4}, {3} 
21: sE. 1241.1 m[ 12333] {1}, 12, 4, 5}, {3} 
22: s[.1212] m[ 12333] {1}, (2, 4), 13, 5} 
23: s[.122.] m[ 12333] 11, 5), 12), 13, 4) 
24: s[.1221] m[ 12333] {1}, (2, 5), 13, 4} 
25: s[.1222] m[ 12333] {1}, {2}, (3, 4, 5} 


Figure 17.3-A: Restricted growth strings in lexicographic order (left, dots for zeros) and array of prefix- 
maxima (middle) for the set partitions of the 5-set into 3 parts (right). 


Figure shows all set partitions of the 5-set into 3 parts, together with their RGSs. The list of 
RGSs of the partitions of an n-set into p parts contains all length-n patterns with p letters. A pattern 
is a word where the first occurrence of u precedes the first occurrence of v if u « v. That is, the list of 
patterns is the list of words modulo permutations of the letters. 


The restricted growth strings corresponding to set partitions into p parts can be generated with [FXT: 
class setpart_p_rgs_lex in comb/setpart-p-rgs-lex.h|: 


1 class setpart_p_rgs_lex 
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{ 

public: 
ulong n_; // Number of elements of 
ulong p_; // Exactly p subsets 


set (set = 11,2,3,.. 


ulong *m ;  // m[k+1] = max(s[O], s[1],..., s[k]) + 1 


2 

3 

4 

5 

6 

T. ulong *s_; // RGS 

j public: 
10 setpart. p rgs lex(ulong n, ulong p) 
11 
12 


n =n; 

13 m. = new ulong[n_+1]; 

14 m [0] = ~OUL; // sentinel m[0] = infinity 
15 s_ = new ulong[n_]; 

16 first (p); 

17 

18 [--snip--] // destructor 

19 

20 void first(ulong p) 

21 // Must have 2<=p<=n 

22 { 

23 for (ulong k=0; k<n_; ++k) s [k] = 0; 

24 for (ulong k=n_-p+1, j=1; k<n_; ++k, ++j) s_I[k] = j; 
25 

26 for (ulong k-1; k«-n ; ++k) m [k] = s_[k-1]+1; 
27 P- = p; 

28 } 

29 


.,nJ) 
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'The method to compute the successor also checks whether the digit is less than p and has an additional 


loop to repair the rightmost digits when needed: 
bool next() 
1 


if ( k == 0) return false; 


s [k] += 1UL; 
ulong mm = m [k]; 
mm += (s [k]?-2mm); 


ulong k = n_; 
bool q; 
do 
{ 
c es 
const ulong ski = s [k] + 1; 
q = (ski > m [k]);  // greater max 
q l= (ski >= p_); // more than p parts 
while ( q ); 


a m [k*1] = mm; // == max2(m_[k], s_[k]+1); 
23 while ( ++k<n_ ) 

24 { 

25 s_[k] = 0; 

26 m_[x+1] = mm; 

27 } 

35 ulong p = p_; 

30 if ( mm<p ) // repair tail 

31 { 

32 do { m_[k] = p; --k; --p; s [k] = p; > 
33 while ( m_[k] < p ); 

34 } 

3g return true; 

37 F 


// if ( 1==p_ ) return false; // make things work with p-- 


As given the computation will fail for p = 1, the line commented out removes this limitation. The rate 


of generation is about 108 M/s [FXT: comb/setpart-p-rgs-lex-demo.cc!. 


mm 
O 10 -1o0» cC &Ó Co n 


11 


10 


10 
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17.3.3 RGS for set partitions in minimal-change order 


For the Gray code we need an additional array of directions, see section [9-2]on page for the equivalent 


routines with mixed radix numbers. The implementation allows starting either with the partition into 
one set or the partition into n sets [FXT: class setpart rgs gray in comb/setpart-rgs-gray.h : 


class Setpart rgs gray 
1 
public: 
ulong n_; // Number of elements of set (set = {1,2,3,...,n}) 
ulong *m_; // m[k*1] = max(s[0], s[1]1,..., s[x]) + 1 
ulong *s_; // RGS 
ulong *d ; // direction with recursion (+1 or -1) 
public: 
setpart rgs gray(ulong n, int dr0=+1) 
// dr0=+1 ==> start with partition {{1,2,3,...,n}} 
// dr0--1 ==> start with partition {{1},{2},{3},...,{n}}} 
{ 
n =n; 
m. = new ulong[n_+1]; 
m [0] = ~OUL; // sentinel m[0] = infinity 
s_ = new ulong[n_]; 
d_ = new ulong[n_]; 
first (dr0) ; 
} 
[--snip--] 


void first(int drO) 
1 
const ulong n = n_; 
const ulong dd = (dr0 >= O ? +1UL : -1UL); 
if ( dd--1 ) 
for (ulong k=0; k<n; ++k) s [k] = 0; 
for (ulong k=1; k<=n; ++k) m [k] = 1; 
} 
else 
{ 


for (ulong k-0; k<n; ++k) s [k] = k; 
for (ulong k-1; k<=n; ++k) m [k] = K; 
} 


for (ulong k-0; k<n; ++k) d [k] = dd; 
} 


The method to compute the successor is 
bool next () 
{ 


ulong k = n_; 
do 4 --k; } while ( (s [k] + d [k]) > m_[k] >; // <0 or >max 


if ( k == 0) return false; 


s_[k] += d [k]; 
m_[k+1] = max2(m [k], s_[k]+1); 
pun ( ++k<n_ ) 
const ulong d = d [k] = -d [kl]; 
const ulong mk = m [k]; 


s [k] = ( (d==1UL) ? O : mk ); 
m [k*i] = mk + (d!-1UL); // == max2(mk, s_[k]+1) 


return true; 


The rate of generation is about 154 M/s [FXT: |comb/setpart-rgs-gray-demo.cc|. It must be noted that 
while the corresponding set partitions are in minimal-change order (see figure [17.1-C on page 356) the 
RGS occasionally changes in more than one digit. A Gray code for the RGS for set partitions into p parts 
where only one position changes with each update is described in [288]. 
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17.3.4 Max-increment RGS t 


The generation of RGSs s = [so, 51,..., 5,4 1] where sj < i+max;<;,(s,;) is a generalization of the RGSs 
for set partitions (where i = 1). Figure |17.3-D| show RGSs in lexicographic order for i = 2 (left) and 
| = i . The strings can be generated in lexicographic order using [FX T: class rgs maxincr in 


class rgs maxincr 


1 

2 A 

3 public: 

4 ulong *s_; // restricted growth string 
5 ulong *m ; // m [k-1] == max possible value for s. [k] 
6 ulong n ;  // Length of strings 

T ulong i.;  // s[k] <= max_{j<k}(s[j]+i) 
8 // i==1 ==> RGS for set partitions 

18 public: 

11 rgs_maxincr(ulong n, ulong i=1) 

12 1 

13 n 7n; 

14 m. = new ulong[n ]; 

15 S. = new ulong[n ]; 

16 d. "1; 

17 firstO; 

18 } 

19 

20 ^rgs maxincr() 

21 1 

22 delete [] m ; 

23 delete [] s_; 

24 } 

25 

26 void first() 

27 { 

28 ulong n = n_; 

29 for (ulong k=0; k<n; ++k) s_[k] = 0; 
30 for (ulong k=0; k<n; ++k) m_[k] = i_; 
31 } 

32 [--snip--] 


The computation of the successor returns the index of first (leftmost) changed element in the string. Zero 
is returned if the current string is the last: 


1 ulong next() 

2 

3 ulong k = n_; 

3 start: | 

£ if ( k--0 ) return 0; 

8 ulong sk = s [k] + 1; 

9 ulong mi - m [k-1]; 

10 if ( sk > miti ) // "carry" 
11 

12 s [k] = 0; 

13 goto start; 

14 } 

15 

16 s_[k] = sk; 

17 if ( sk>mi ) mi = sk; 

18 for (ulong j-k; j<n_; **j ) m [jl] = mi; 
A0 return k; 

21 } 

22 [--snip--] 


About 115 million RGSs per second are generated with the routine. Figure |17.3-B] was created with 
the program [FXT: |comb/rgs-maxincr-demo.cc,. The sequence of numbers of max-increment RGSs with 


increment 7 —1, 2, 3, and 4, start 


n: 0 12 3 4 5 6 T 8 9 10 
i=l: 1 1 2 5 15 52 203 877 4140 21147 115975 
i=2: 1 1 3 12 59 339 2210 16033 127643 1103372 10269643 
i=3: 1 1 4 22 150 1200 10922 110844 1236326 14990380 195895202 
i-4: 1 1 5 35 305 3125 36479 475295 6811205 106170245 1784531879 
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max(5,1) 


RGS (5,1) 


max (4,2) 


RGS (4, 2) 


ar a a ee ee ee ee ee | | es ] pel RO ee eR ee | ramo A | pes ee | A | o Lo Us À 
PA o CN AANA HUN NNN MH HUN H HUN NNN MH ON H HUNN ON ON. MANA CON. CONL MANAN NN CY. OY. CL 02) 0 00 00 (02 ST 
SMM o o o o ANNAN NAAA HANNAN IL o LCONL CN CN. CN CN. NNN CL CN. CY. CY. CY. CY. CY OY. 02. 002 02 00 OD 
ttt t n o c o o o o o o o o o o o o o o o o oLCONL CON. CN. CN. CN OY. CN. CY. CY. CY. CY. CY. CY. CY. CY. NAN 
e a o aan, 


ee | | JL |) JL UL dL 


PATE IIIA IAI IAAI TAIT AAPA AAP APA AP AP AP AIP AP AP AP AP TAP AP TAP AP TARP AP TARP Near ae ae ae ae ae 
++ AA PCMH NN NN + e PANNA NAN e e PANNA NAN e e e e AA AAA ANNAN NN OM OY 
e HA AA AA A AAA AAA HH HHH ANNAN NANANNNANNANNNINNAN 
se o aas Daan aan, 


mmmn mnn AAA AAA AA AAA aa 
TAN 0 o CN OO. CN CON. ON. OO). STE CN OO. 0 NM CY. ON CY. 00) SP 00 00) 00 00) SP LO. CN. CN OY 00. SP CV CON ON CO. SP CN ON. OY 00) ST! 00 00 00 (0) ST! LO. ST! ST! SP SI SP LO SO 
tomm om ot os tf t] t! t] t9 ng o o A o o o o o o o CNL CN. CN. CY. CONI CY ON. CY. CY. CY. NN NANN CY. CN. CY. CY. CY. CY. CY. CY. CY. CY. CY. CY. CY CN 


LLL CL CIL CL JL IL JL JL IL JL IL IL JL JL JL IL IL JL JL JL JL IL JL JL JL JL IL JL IL IL JL IL JL IL IL JL JL JL IL JL JL JL JL JL IL IL JL JL JL IL JL JL IL JL IL JL JL JL 


AAA AAA AAA AA AAA AA AAA AA AAA AAA AA AAA AA AAA AA AAA AAA AA AAA AA AAA AA AAA AAA AA AA 
HN ON OO. 8 QN OO SP HANM "HNM ON OO SP ON CO SP LO. ON CO SP CANYONS ¿ANOS 2 ON CO SP LO. ON OO ST! LO O 
tof oto oH OONON CNONON e + + rom oHUHLON CNN ON CN O2 02 00 00 (00 00908 e E AAAAANNA NA A OY. 00 00 00) 00 00 00 ST! STI STE SP SE SP SP 
omm osos o s tf t] t8 tg. t£, ng o o o n o oo o CONI CN. CN. CY. CONI AAN CY. CY. CY. NANN CY. CY. NNN CY. CY. CY. CY. CY. CY. CY. CY. CY. CY CY 


Figure 17.3-B: Length-4 max-increment RGS with i = 2 and the corresponding array of maxima (left) 


and length-5 RGSs with i = 1 (right). Dots denote zeros. 
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The sequence for i = 2 is entry A080337 in [312], it has the exponential generating function (EGF) 


= n 2 3 
» Bn41,2 2 = exp E + exp(z) + EON | (17.3-1) 
e n! 2 2 
The sequence of numbers of increment-3 RGSs has the EGF 
a n 2 3 11 
> Bajas Z = exp E + exp(x) + ada + Spee) | (17.3-2) 
= n! 2 3 6 


Omitting the empty set, we restate the EGF for the Bell numbers (relation |17.2-4 on page 359) as 


= g^ 1. 2 5 15 52 
5 Bn+1,1 4 7 o9P [x +exp(x) - 1] = ol 4 T + 31 a? + 3i z? + 1 a+... (17.3-3) 
n=0 
The EGF for the increment-i RGS is 
E qn f exp(j x) — 1 
So Bari = explet) — = — (17.3-4) 
n=0 nl j=l J 


17.3.5 F-increment RGS t 


For a different generalization of the RGS for set partitions, we rewrite the condition sy < i+ max;jz;(s;) 
for the RGS considered in the previous section: 


sk X M(k)+i where M(0) = 0 and (17.3-5a) 
E if sk+1 — Sk > 0 
Mera = { M(E) a i Dre) 


The function M(k) is max;<z(s;) in notational disguise. We define F-increment RGSs with respect to a 
function F as follows: 


sk < F(k)+i where F(0) = 0 and (17.3-6a) 
= Sk+1 if Sk--1 — Sk = 1 7 
ee e { F(k) otherwise (Araib) 


The function F (k) is a ‘maximum’ that is increased only if the last increment (sk — Sk—-1) was maximal. 
For i = 1 we get the RGSs for set partitions. Figure |17.3-C] shows all length-4 F-increment RGSs for 


i = 2 (left) and all length-3 RGSs for i = 5 (right), together with the arrays of F-values. The listings were 
created with the program [FXT: comb/rgs-fincr-demo.cc| which uses the implementation [FXT: class 
rgs_fincr in comb/rgs-fincr.h 


class rgs_fincr 
public: 
ulong *s_; // restricted growth string 
ulong *f ; // values F(k) 
ulong n ; // Length of strings 
ulong i ;  // s[k] <= f[k]+i 
[--snip--] 


ulong next() 
// Return index of first changed element in s[], 
// Return zero if current string is the last 


ulong k = n_; 


if ( k==0 ) return 0; 
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a Me ee i aR A LE C ARN DEC Ry => 
A ETA LO . . . . e) . . . . “WO . . > . * LO . . . . * LO) LO LO LO LO LO LO LO LO LO LO 3 
IN 
(Oi EE ah a E A A a a A A LO LO LO LO LO LO LO LO LO LO LO 
M 
[np detemDOk cde WX Ter SECO DRE o Tues Ver ues SEC DES ae ke desi ta eS ORI t Dn Ns ARS ade Re RR ele go 


LLL LL IL JL JL IL JL IL JL JL IL JL CL IL JL IL JL CL IL JL JL JL JL IL JL JL JL JL QE IL JL JL JL CL ILL IL JL JL 


mi 
mmmn AAA AAA AAA AAA AAA roo AAA AAA AAA mammaa 


^ o 


Q. woes wei*e E NN n d o. CONL CON. ON. ON. ON NN 00 00 00) 00) 00 ST! ST! STE STE ST SP LO LO LO LO LO LO LO LO LO LO LO 


$ Hot ciii i c dc c c c Qc ii dc jc 
TH CN CO) ST! LO &O [—- 00 O) O à CY CO sf! LO «O P—- 00 O) O wH 
NH c o n c c n n n ON CN 


G Bia de^ ale, Re cus NANNNN © + +. +. + + CN CN. CN. CN. CN. CN. CN. CN. CN. CNL ON. CN. ON. CY. CY. CN. CL ON. CN. OY. CY. CY. NNA N ST ST! ST! ST! ST SP SP 
PS ROG ee gs E AR M E PL PLI NE M MT CN. CN. CN CY CY. CY. CN. CY. CN CN. CV. CY CY. CY CY. CN. OY. CN. CN. CN. CY CY. CY CY. CY CY CN 


LLL ILL IL JL JL CL JL JL JL JL IL JL CL IL JL IL JL EL IL JL JL JL JL IL JL JL JL JL IL IL JL JL JL IL IL JL JL JL JL IL CL IL JL JL IL JL IL ) 


HAHAHA AAA AAA AAA AAA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AA AA 


= 
N AN AN ANON SPON. CAN - INN $ 1 ON OO SP 1 ON CO SP 1 ON CO SP 1 ON OO SPP ON OO ST! LO O 
Y toco tH HH ON CN CN CN CN. + + S HB ONON CNONQON + 8 ++ oo HH ON. CON. OY. OY. OY 02 00. 00 00 00) ST! ST! ST STE SP SE SP 
E DE Sa FE hee AAA n AAA ANNA CONI CN. CN. CN. CN. OV. OL CN. CL OY. CV. CL NNN NNN OY. CY. CY. OY. CYL CY. CN 
A cc 


Figure 17.3-C: Length-4 F-increment restricted growth strings with maximal increment 2 and the 
corresponding array of values of F (left) and length-3 RGSs with maximal increment 5 (right). 


denote zeros. 


Dots 
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ulong sk = s_[k] + 1; 
ulong mi = f [k-1]; 
ulong mp mi + i_; 


if ( sk > mp) // "carry" 


s_[k] = 0; 
goto start; 
} 


S [k] = sk; 
if ( sk==mp ) mi += i_; 


for (ulong j-k; j<n_; **j ) £f [jl = mi; 
return k; 
} 
[--snip--] 

The sequences of numbers of F-increment RGSs with increments i —1, 2, 3, and 4, start 
n: O 1 2 3 4 5 6 T 8 9 
i=1: 1 2 5 15 52 203 877 4140 21147 115975 
i=2: 1 3 11 49 257 1539 10299 75905 609441 5284451 
i=3: 1 4 19 109 742 5815 51193 498118 5296321 60987817 
i-4: 1 5 29 201 1657 15821 170389 2032785 26546673 376085653 
i=5: 1 6 41 331 3176 35451 447981 6282416 96546231 1611270851 


These are respectively entries |A000110) (Bell numbers), |A004211, |A004212, A004213, and |A005011| in 
rs in [203 


312]. The shown array appea . In general, the number £F; of F-increment RGSs (length n, 
with increment i) is 


Faa = X Po S(n,k) (17.3-7) 
k=0 


where S(n, k) are the Stirling numbers of the second kind. The exponential generating functions are 


9o n : —1 
5 Pri a = exp E (17.3-8) 


2 
n=0 


The ordinary generating functions are 


oo oo p^ 
Enig” = = (17.3-9) 
» 2 TI Ute) 


n=0 


17.3.6 K-increment RGS t 


is Lo sad 

2: [= . 1] 11: [..21] 20: [.11.1 29: [.124] 
3: L[. .2] 12: [..22] 21: [.111] 30: [.125] 
4: [...3] 13: [..23] 22: [.112] 31: [. 13.81] 
bs bx wd] 14: [..24] 23: [.113] 32: [—.131] 
6: [..111] 15: [..25] 24: [—.114] 33: [.132] 
T: [..12] 16: [.1..] 25: [.12.1] 34: [.133] 
Se uu EOS ] 17: [. 1 1] 26: [—.121] 35: [.134] 
9: [..124] 18: [.1.2] 27: [.122] 36: [.135] 
10: [..2.] 19: [.1.3] 28: [—.123] 37: [.136] 


Figure 17.3-D: The 37 K-increment RGS of length 4 in lexicographic order. 
We mention yet another type of restricted growth strings, the K-increment RGS, which satisfy 
Sk < Sp-itk (17.3-10) 


An implementation for their generation in lexicographic order is given in [FXT: |comb/rgs-kincr.h |: 
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1 class rgs_kincr 

2 A 

3 public: 

4 ulong *s_; // restricted growth string 

5 ulongn ;  // Length of strings 

6 [--snip--] 

T 

8 ulong next() 

9 // Return index of first changed element in s[], 
10 // Return zero if current string is the last 
11 
12 $ ulong k = n_; 

3 start: 

16 if ( k==0 ) return 0; 
18 ulong sk = s_[k] + 1; 
19 ulong mp = s_[k-1] + k; 
20 if ( sk > mp ) // "carry" 
21 { 

22 s_[k] = 0; 

23 goto start; 

24 

25 

26 s_[k] = sk; 

27 return K; 

28 } 

29 [--snip--] 


The sequence of the numbers of K-increment RGS of length n is entry A107877 in [312]: 


n: 0 12 3 4 ,5 6 T 8 9 10 
1 1 2 7 37 268 2496 28612 391189 6230646 113521387 


The strings of length 4 are shown in figure |17.3-D| They can be generated with the program [FXT: 


comb /rgs-kincr-demo.cc|. 
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Chapter 18 


Necklaces and Lyndon words 


A sequence that is minimal among all its cyclic rotations is called a necklace (see section|3.5.2 on page 149 


for the definition in terms of equivalence classes). Necklaces with k possible values for each element are 
called k-ary (or k-bead) necklaces. We restrict our attention to binary necklaces: only two values are 
allowed and we represent them by 0 and 1. 
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Figure 18.0-A: All binary necklaces of lengths up to 8 and their periods. Dots represent zeros. 


To find all length-n necklaces we can, for all binary words of length n, test whether a word is equal to 
its cyclic minimum (see section [1.13 on page 29). The sequences of binary necklaces for n < 8 are shown 
in figure[18.0-A] As 2” words have to be tested, this approach is inefficient for large n. Luckily there is 
both a much better algorithm for generating all necklaces and a formula for their number. 


Not all necklaces are created equal. Each necklace can be assigned a period that is a divisor of the length. 
That period is the smallest (nonzero) cyclic shift that transforms the word into itself. The periods are 
given directly right to each necklace in figure [18.0-A] For n prime the only periodic necklaces are those 
two that contain all ones or zeros. Aperiodic (or equivalently, period equals length) necklaces are called 
Lyndon words. 
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For a length-n binary word x the function bit_cyclic_period(x,n) from section|1.13 on page 29|returns 
the period of the word. 


18.1 Generating all necklaces 


We give several methods to generate all necklaces of a given size. An efficient algorithm for the generation 


of bracelets (see section 3.5.2.4 on page 150) is given in [299]. 
18.1.1 The FKM algorithm 


1: ls .] jet N 1: e .] jt N 

2: L. 1] j=4 NL 2: [. .1] 3=6 NL 
3: [...2] j=4 N L 3: NE 1.]  je5 

4&4: [..1.1] 33 4: [....11] j6 NL 
5: [..11] je4 NL 5: [...1..] je 

6: [..12] j4 NL 6: Es 1.1] 3=6 NL 
7 [..2.] 35e 7: ps 11.] je5 

8: Es 21] j=4 N L 8: [D ed ddl j=6 N L 
9: [..22] j=4 N L 9: [xd EN j=3 N 
10: [.1 1] j=2 N 10: Ea sl 1.] j=5 

11: Es de. 2] j=4 N L 11: [..1.11] j=6 N L 
12: [.11.] j=3 12: Doo d 1] j=4 

13: [.111] je4 N L 13: [..11.1] je6 N L 
14: [.112] joa N L 14: [..111.] j-5 

15: [—.12.] j=3 15: [..1111] 3=6 N L 
16: | [.121] j=4 NL 16: [.1.1.1] 3=2 N 
17: [—.122] joa N L 17: [.1.11.] j5 

18: [.2.2] j=2 N 18: [.1.111] 3=6 N L 
19: [.21.] j=3 19: [.11.11] j=3 N 
20: [.211] j=4 N L 20: [.111.1] j=4 
21: [.212] j=4 N L 21: [.1111.] j=5 
22: Ee 2:52: j=3 22: [.11111] j=6 N L 
23: [.221] j=4 N L 23: [111111] ji N 
24: [.222] j=4 N L 23 (6, 2) pre-necklaces. 
25: [1111] j=l N 14 necklaces and 9 Lyndon words. 
26: [1112] j=4 N L 
27: [1121] j=3 
28: [1122] j=4 N L 
29: [1212] j=2 N 
30: [1221] j=3 
31: [1222] j=4 N L 
32: [2222] j=l N 


32 (4, 3) pre-necklaces. 
24 necklaces and 18 Lyndon words. 


Figure 18.1-A: Ternary length-4 (left) and binary length-6 (right) pre-necklaces as generated by the 
FKM algorithm. Dots are used for zeros, necklaces are marked with ‘N’, Lyndon words with ‘L’. 


The following algorithm for generating all necklaces actually produces pre-necklaces, a subset of which 
are the necklaces. A pre-necklace is a string that is the prefix of some necklace. The FKM algorithm (for 
Fredericksen, Kessler, Maiorana) to generate all k-ary length-n pre-necklaces proceeds as follows: 


1. Initialize the word F = [fi, fo, ..., fn] to all zeros. Set j = 1. 
2. (Visit pre-necklace F. If 7 divides n, then F is a necklace. If j equals n, then F is a Lyndon word.) 


3. Find the largest index j so that f; < k—1. If there is no such index (then F = [k—1, k—1, ..., k—1], 
the last necklace), then terminate. 


4. Increment fj. Fill the suffix starting at f;,1 with copies of [f1, ..., fj]. Goto step 2. 


372 Chapter 18: Necklaces and Lyndon words 


The crucial steps are [FXT: comb/necklace-fkm-demo.cc|: 


1 for (ulong i=1; i<=n; ++i) f[i] = 0; // Initialize to zero 
2 bool nq = 1; // whether pre-necklace is a necklace 

3 bool lq = 0; // whether pre-necklace is a Lyndon word 
4 ulong j = 1; 

5 while ( 1) 

6 1 

T // Print necklace: 

8 cout << setw(4) << pct << ":"; 

9 print_vec(" ", f+1, n, true); 

10 cout << " je" << j; 

11 if (nq) cout << " N"; 

12 if (1q) cout <<" L"; 

T cout << endl; 

15 // Find largest index where we can increment: 

16 j= n; 

17 while ( f[j]==k-1 ) { --j; X 

18 

19 if ( j==0 ) break; 

20 

21 ++f [j]; 

22 

23 // Copy periodically: 

24 for (ulong i=1,t=j+1; t<=n; ++i,++t) f[t] = f[il; 
25 

26 nq = ( (n4j)==0 ); // necklace if j divides n 

27 lg = ( j==n ); // Lyndon word if j equals n 
28 } 


Two example runs are shown in figure |18.1-A| An efficient implementation of the algorithm is [FXT: 
class necklace in comb/necklace.h|: 


1 class necklace 

3 public: 

4 ulong *a_; // the string, NOTE: one-based 

5 ulong *dv_; // delta sequence of divisors of n 
6 ulong n;  // length of strings 

7 ulong mi_; // m-ary strings, mi=m-1 

8 ulong j.; // period of the word (if necklaces) 
10 public: 

11 necklace(ulong m, ulong n) 

12 1 

13 n -(n?n:412); // at least 1 

14 mi_ = ( m»i ? m-1 : 1); // at least 2 

15 a. = new ulong[n *1]; 

16 dv. = new ulong[n *1]; 

17 for (ulong j=1; j<=n; ++j) dv_[j] = (0-2( 4j ) ); // divisors 
18 first(); 

19 } 

20 [--snip--] 

21 

22 void first() 

23 

24 for (ulong j=0; j<=n_; ++j) a_[j] = 0; 

25 Jo is 

26 

27 [--snip--] 


'The method to compute the next pre-necklace is 


1 ulong next pre() // next pre-necklace 

2 // return j (zero when finished) 

3 1 

4 // Find rightmost digit that can be incremented: 
5 ulong j = n_; 

6 while ( a_[j] == m1.) {--j; } 

T 

8 // Increment: 

9 // if ( 0==j_ ) return 0; // last 
10 ++a_[j]; 

11 

12 // Copy periodically: 


13 
18 
16 


17 
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for (ulong k=j+1; k<=n_; **k) a [k] = a [k-jl; 


j-= ji. 
return j; 


} 


Note the commented out return with the last word, this gives a speedup (and no harm is done with the 
following copying). The array dv is used to determine whether the current pre-necklace is also a necklace 


COMO ON OO WHF 


E 


OONO TAWON e 


10 


(or Lyndon word) via simple lookups: 
bool is_necklace() const 


return ( O!-dv [j.] ); // whether j divides n 
} 


bool is_lyn() const 


return ( j_==n_ ); // whether j equals n 


} 


The methods for the computation of the next necklace or Lyndon word are 


ulong next() // next necklace 
do 
{ 
next preO; 
if ( 0==j_ ) return 0; 
} 
while ( 0==dv_[j_] ); // until j divides n 
return j_; 
} 
ulong next_lyn() // next Lyndon word 
{ 
do 
{ 
next preO; 
if ( 0==j_ ) return 0; 
} 
while ( j_==n_ ); // until j equals n 
return j_; // ==n 
} 


$; 
The rate of generation for pre-necklaces is about 98 M/s for base 2, 140 M/s for base 3, and 180 M/s for 
base 4 [FXT: comb/necklace-demo.cc). A specialization of the algorithm for binary necklaces is [FXT: 
class binary.necklace in comb/binary-necklace.h/. The rate of generation for pre-necklaces is about 


128 M/s [FXT: comb/binary-necklace-demo.cc|. A version of the algorithm that produces the binary 
necklaces as bits of a word is given in section |1.13.3| on page [30] 


The binary necklaces of length n can be used as cycle leaders in the length-2” zip permutation (and its 


inverse) that is discussed in section |2.10 on page 125| An algorithm for the generation of all irreducible 


binary polynomials via Lyndon words is described in section [40.10 on page 856 


18.1.2 Binary Lyndon words with length a Mersenne exponent 


The length-n binary Lyndon words for n an exponent of a Mersenne prime M,, = 2" — 1 can be generated 
efficiently as binary expansions of the powers of a primitive root r of M,, until the second word with just 
one bit is reached. With n = 7, M; = 127 and the primitive root r = 3 we get the sequence shown in 
figure 'The sequence of minimal primitive roots r, of the first Mersenne primes M,, — 2" — 1 is 
entry in [912]: 


2: 2 17: 3 107: 3 

3: 3 19: 3 127: 43 

5: 3 31: 7 521: 

7T: 3 61: 37 607: 5 <--= b is a primitive root of 2**607-1 
13: 17 89: 3 1279: 5 


374 Chapter 18: Necklaces and Lyndon words 


O: AS sca wis 1 = 1 EA wide 1 
1: a= ..... 11 = 3 == 00 ...2. 11 
2 a= ...1..1- 9 == ee eee 
3: a= ..11.11 = 27 == ae ee eee 
4: a= 1.1...1= 81 == weed had 
5: a= 111.1.. = 116 == wed dd 2d 
6 : a- 1.1111. - 94 == .1.1111 
7: a= ..111.. = 28 == EE 
8 : a- 1.1.1.. - 84 == eds dd 
9: a= 11111.1 = 125 == .111111 
10 : a= 1111..1 = 121 == Ae 
11: a= 11.11.1 = 109 == .11.111 
12 : =1..1..1= 73 PO ere El 
13 : = 1.111.. = 92 ..1.111 
14 : = ..1.11.- 22 == cedi 
15 : - 1....1. = 66 == os ld 
16 : a= 1...111= 71 == saa See 
17 : a= 1.1.11. = 86 == .1.1.11 
18 : a= adas = 4 E edes 1 <--= sequence restarts 
19 : =...11.. = 12 == wee. 11 
20 : a= .1..1.. = 36 == mes kayd 
21 : a= 11.11.. = 108 == salt 11 
22 : a= 1...11. = 70 == weed ded 
23 : a= 1.1..11= 83 == dl dl 
24 : a= 1111.1. = 122 == .1.1111 
25 : a- 111.... - 112 == Pied 
[--snip--] 


(O 00 -1O» OU MNR 


O 00IDO cob e 


Figure 18.1-B: Generation of all (18) 7-bit Lyndon words as binary representations of the powers 
modulo 127 of the primitive root 3. The right column gives the cyclic minima. Dots are used for zeros. 


18.1.3 A constant amortized time (CAT) algorithm 


A constant amortized time (CAT) algorithm to generate all k-ary length-n pre-necklaces is given in [95]. 
The crucial part of a recursive algorithm [FX T: comb/necklace-cat-demo.cc| is the function 


ulong K, N; // K-ary pre-necklaces of length N 
ulong f[N]; 
void crsms_gen(ulong n, ulong j) 


if (n»N) visit(j); // pre-necklace in f[1,...,N] 
else 


f[n] = f[n-jl; 
crsms_gen(n+1, j); 


for (ulong i=f[n-j]+1; i«K; ++i) 
f[n] = i; 


crsms_gen(n+1, n); 


} 


After initializing the array with zeros the function must be called with both arguments equal to 1. The 
routine generates about 71 million binary pre-necklaces per second. Ternary and 5-ary pre-necklaces are 
generated at a rate of about 100 and 113 million per second, respectively. 


18.1.4 An order with fewer transitions 


The following routine generates the binary pre-necklaces words in the order that would be generated by 
selecting valid words from the binary Gray code: 


void xgen(ulong n, ulong j, int x=+1) 


if (n> N) visit(j); 
else 


if ( -1==x ) 
{ 


if ( 0==f[n-j] ) 4{ f[n] = 1; xgen(n+1, n, -x); } 
f[n] = f[n-jl; xgen(n+1, j, +x); 
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Figure 18.1-C: The 30 binary 8-bit Lyndon words in an order with few changes between successive 
words. Transitions where more than one bit changes are marked with a ‘<<’. 


n: Xn n: Xn n: Xn n: Xn n. Xn 
1: 0 ie 2 13: 95 19: 2598 25: 85449 
2: 0 8: 5 14: 163 20: 4546 26: 155431 
3: 0 9: 11 15: 290 21: 8135 27: 284886 
A: 0 10: 15 16: 479 22: 14427 28: 522292 
5: 1 11: 34 17: 859 23: 26122 29: 963237 
6: 1 12: 54 18: 1450 24: 46957 30: 1778145 


Figure 18.1-D: Excess (with respect to Gray code) of the number of bits changed. 


11 else 
13 f[n] = f[n-jl; xgen(n+1, j, +x); 
14 if ( 0==f[n-j] ) 4{ f[n] = 1; xgen(n+1, n, -x); } 


16 } 
17 3 


The program [FXT: comb/necklace-gray-demo.cc| computes the binary Lyndon words with the given 


routine. The ordering has fewer transitions between successive words but is in general not a Gray code 
(for up to 6-bit words a Gray code is generated). Figure shows the output with 8-bit Lyndon 
words. The first 21/21 —1 Lyndon words of length n are in Gray code order. The number Xn of additional 
transitions of the length-n Lyndon words is, for n < 30, shown in figure[18.1-D] 


18.1.5 An order with at most three changes per transition 
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Figure 18.1-E: The 30 binary 8-bit necklaces in an order with at most 3 changes per transition. Tran- 
sitions where more than one bit changes are marked with a ‘<<’. 


An algorithm to generate necklaces in an order such that at most 3 elements change with each update 
is given in [352]. The recursion can be given as (corrected and shortened) [FXT: comb/necklace-gray3- 
|lemo.ed] 


long +f; // data in f[1..m], f[0] = 0 
long N; // word length 
int k; // k-ary necklaces, k--sigma in the paper 


void gen3(int z, int t, int j) 


NO OTR WN FR 


if ( t > N ) { visit(j); } 
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n: Xn n: Xn nz Xn n: Don n: Xn 
1: 0 T: 6 13: 200 19: 6462 25: 239008 
2: 1 8: 12 14: 360 20: 11722 26: | 441370 
3: 2 9: 20 15: 628 21 21234 27: 816604 
4: 2 10: 38 16: 1128 22: 38754 28: 1515716 
5: 2 11: 64 17: 1998 23: 70770 29: 2818928 
6: 4 12: 116 18: 3606 24: 129970 30: 5256628 


Figure 18.1-F: Excess (with respect to Gray code) of number of bits changed. 


else 
T ( (z&1)==0 ) // z (number of elements ==(k-1)) is even? 
for (int i=f[t-j]; i<=k-1; ++i) 
f[t] = i; 
gen3( z*(i!-k-1), t-1, (i!-f[t-j]?t:j) ); 
else 
1 
for (int i-k-1; i»-f[t-jl; --i) 


f(t] = i; 
gen3( z*(i!-k-1), t+1, (i!=f[t-j]?t:j) ); 


} 


} 
} 


The variable z counts the number of maximal elements. The output with length-8 binary necklaces is 
shown in figure|18.1-El Selecting the necklaces from the reversed list of complemented Gray codes of the 
n-bit binary words produces the same list. 


18.1.6 Binary necklaces of length 2" via Gray-cycle leaders 1 


16 cycles of length= 8 L= 1..1.11. [ 1.11. ] 
Le A E ] ==> 18,111.  [ Lit] 
Es Lisa 1 L[.1111111 ] ==> 1.11.14 [dss t.t J 
LS di Ll. [.1:1.4. J --» 111.1.1. [ 11....1. ] 
LS docu 114 [ 11.1.1.1 ] ==> 1..11111. Eo dde dl] 
B= A PO O E J E A AA 
b= de... 1.1 [ 2.11..11 J sc» 3.111... ^. [D 21.11... J 
L= 14....11. [ 111..11. ] sez» Miles Lol i J 
L= 1....111 [ ...11..1] 

Ls dod... L[vtllece ] L= 1..1.111 [ 111.1..1 ] 
L= 1..1...1 [1...1111 ] --» 11.111.. [ 1111.1.. ] 
L= 1..1..1. [ 11.11.1. ] --» 1.11..1. [ .1111.1. ] 
L= testadi  [ 2 :1.:1.1 J --» 111.1.11 [ 1111.1 ] 
L=- 41..3.1.. D d.1111.. ] ==> 1..1111. -[ 1..111i. J 
b= 41..1.1.1 T ....11 ] ==> 11.1...1 [ .1..1111 ] 
L= 1..1.11. 1 ...1.11. ] sc» 1.111..1 -L 1.1... 111] 
L= 1..1.111 [ 111.1..1 ] --» 111..1.1 1[ 11.1..11 ] 


Figure 18.1-G: Left: the cycle leaders (minima) L of the Gray permutation with highest bit at index 7 
and their bit-wise Reed-Muller transforms Y (L). Right: the last two cycles and the transforms of their 
elements. 


'The algorithm for the generation of cycle leaders for the Gray permutation given section|2.12.1 on page 128 
and relation |1.19-10c on page 53| written as 


S,Ys = Ya (18.1-1) 


(Y is the yellow code, the bit-wise Reed-Muller transform) can be used for generating the necklaces of 
length 2": The cyclic shifts of Y x are equal to Y g* x for k = 0,...,1 — 1 where l is the cycle length. 
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Figure|18.1-G|shows the correspondence between cycles of the Gray permutation and cyclic shifts. It was 
generated with the program [FXT: comb/necklaces-via-gray-leaders-demo.cc). 


If no better algorithm for the cycle leaders of the Gray permutation was known, we could generate them as 
Y-!(N) = Y (NV) where N are the necklaces of length 2”. The same idea, together with relation|1.19-11b 


lon page 53| give the relation 
5,Ba = Beta (18.1-2) 


where B is the blue code and e the reversed Gray code. 


18.1.7 Binary necklaces via cyclic shifts and complements + 


n=83 n=6 n=7 n=8 [n=8 cont. ] 
S uu qe. as 1 l3. ud ides 1 13 llis 1 19: .11 
2 .11 2t ii LL DEE e 11 P MENT 11 207 2e 1.1 
3 111 3t mes LLL 32 oe KIL Bie ues 111 21: oe. L.11 
4: ..1111 4: ..1111 4: — ....1111 22:.. «ex det 
n=4 b: .11111 5: .11111 b: ...11111 23:  ..1.1111 
le” gaat 6: 111111 6: 111111 6: ..111111 24: .1.11111 
25 sall T2. sold 7: 1111111 T: .1111111 25: .1.11.1 
3: .111 8: .11.11 8: ..111.1 8: 11111111 26: 1.11.11 
4: 1111 OF: oe ed 98) nce htt 9: ..1111.1 2T: 1.1.1 
bs. Vio 105 .-..1.41 10: ..11.11 10 111.1 28: 1.1.11 
11:  .1.111 11:  .11.111 11: ..111.11 29: 1.1.111 
n=5 125 ..:1.1.1 12*- icd 12: .111.111 30: 1.1.1.1 
d cation: 135. elie T35- sud $35 ache ded 31: E aS 
25 dl 14: ..1.111 14: sek 1 TA 32: 1..11 
3: ..111 15: 1.1111 15: ..11.111 33: .111 
4:  .1111 16*. 1.1.1 16: .11.1111 34: 1..1.1 
b: 11111 AT: .1.1.11 17:  ..11.1.1 3b: 1:4 
6: ..1.1 18: ...1..1 18: ..11..1 
Tz «1.11 19: doc 


Figure 18.1-H: Nonzero binary necklaces of lengths n — 3,4,...,8 as generated by the shift and 
complement algorithm. 


A recursive algorithm to generate all nonzero binary necklaces via cyclic shifts and complements of the 
lowest bit is described in [287]. An implementation of the method is given in [FXT: comb/necklace- 
sigma-tau-demo.ce 


1 inline ulong sigma(ulong x) { return bit rotate left(x, 1, n); + 
2 inline ulong tau(ulong x) í return x ^ 1; } 

3 

4 void search(ulong y) 

5 

6 visit (y); 

7 ulong t = y; 

8 while (1) 

9 { 

10 t = sigma(t); 

11 ulong x = tau(t); 

12 if ( (x&1) && (x == bit cyclic min(x, n)) ) search(x); 
13 else break; 

14 h 

15 } 


The initial call is search (1). The generated ordering for lengths n = 3,4,...,8 is shown in figure|18.1-H 


18.2 Lex-min De Bruijn sequence from necklaces 


The lexicographically minimal De Bruijn sequence can be obtained from the necklaces in lexicographic 
order as shown in figure [18.2-A] Let W be a necklace with period p, and define its primitive part P(W) 
to be the p rightmost digits of W. Then the lex-min De Bruijn sequence is the concatenation of the 
primitive parts of the necklaces in lex order. 


An implementation is [FXT: class debruijn in comb/debruijn.h : 


378 
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neckl. period P(neckl.) 


0000 1 
0001 4 0001 
0002 4 0002 
0011 4 0011 
0012 4 0012 
0021 4 0021 
0022 4 0022 
101 2 01 
0102 4 0102 
0111 4 0111 
0112 4 0112 
0121 4 0121 
0122 4 0122 
0202 2 02 
0211 4 0211 
0212 4 0212 
0221 4 0221 
0222 4 0222 
1111 1 1 
1112 4 1112 
1122 4 1122 
1212 2 12 
1222 4 1222 
2222 1 2 


O 0001 0002 0011 0012 0021 0022 01 0102 0111 0112 [--snip--] 1122 12 1222 2 == 
000010002001100120021002201010201110112012101220202110212022102221111211221212222 


00 -JO» Ct i02 b2 — 


10 


Figure 18.2- A: The 3-ary necklaces of length 4 (left) and their primitive parts (right). The concatenation 


of the primitive parts gives a De Bruijn sequence (bottom). 


class debruijn : public necklace 
// Lexicographic minimal De Bruijn sequence. 
public: 
ulong i. ; // position of current digit in current string 
public: 


debruijn(ulong m, ulong n) 
: necklace(m, n) 
{ first stringO; } 


~debruijnQ) 1; 3 
ulong first string() 


necklace: :first(); 
l1. = 1; 


return j_; 


} 


ulong next_string() // make new string, return its length 


necklace: :next(); 
i_ = (j_ != 0); 


return j_; 


} 


ulong next_digit() 
// Return current digit and move to next digit. 
// Return m if previous was last. 


{ 
if (i. == 0 ) return necklace::mi_ + 1; 
ulong d = a_[ i_ ]; 
if (i, == j_ ) next_string(); 
else ++i_; 
return d; 
} 
ulong first_digit() 
{ 


first_string(); 
return next_digit(); 
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} 
3 


Usage is demonstrated in [FXT: |comb/debruijn-demo.cc|: 


ulong m = 3; // m-ary De Bruijn sequence 
ulong n = 4; // length = m**n 
debruijn S(m, n); 
ulong i = S.first stringO; 
do 
1 
cout << " "; 
for (ulong u-1; u<=i; ++u) cout << 8.a [u]; // note: one-based array 
i = S.next stringO; 
} 
while ( i ); 


For digit by digit generation, use 
ulong i = S.first digitO; 
do 
t 

cout << i; 
i = S.next digit O ; 


while ( i!=m ); 
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A special version for binary necklaces is [FXT: class binary debruijn in comb/binary-debruijn.h|. 


18.3 The number of binary necklaces 


n: Nn n: Nn n: Nn n: Nn 
1: 2 11: 188 21: 99880 3l: 69273668 
2: 3 12: 352 22: 190746 32: 134219796 
3: 4 13: 632 23: 364724 33: 260301176 
4: 6 14: 1182 24: 699252 34: 505294128 
5: 8 15: 2192 25: 1342184 35: 981706832 
6: 14 16: 4116 26: 2581428 36: 1908881900 
7: 20 17: 7712 27: 4971068 37: 3714566312 
8: 36 18: 14602 28: | 9587580 38: | 17233642930 
9: 60 19: 27596 29: 18512792 39: 14096303344 

10: 108 20: 52488 30: 35792568 40: 27487816992 

Figure 18.3-A: The number of binary necklaces for n < 40. 

n: Ly n: Ln n: Ln n Ln 
1: 2 11: 186 21: 99858 31: 69273666 
2: 1 12: 335 22: 190557 32: 134215680 
3: 2 13: 630 23: 364722 33: 260300986 
4: 3 14: 1161 24: 698870 34: 505286415 
5: 6 15: 2182 25: 1342176 35: 981706806 
6: 9 16: | 4080 26: 2580795 36: | 1908866960 
7: 18 17: 7710 27: 4971008 37: | 3714566310 
8: 30 18: 14532 28: | 9586395 38: | 7233615333 
9: 56 19: 27594 29: 18512790 39: 14096302710 

10: 99 20: 52377 30: 35790267 40: 27487764474 


Figure 18.3-B: The number of binary Lyndon words for n < 40. 
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The number of binary necklaces of length n equals 


1 lg j 
NM = LE = Yn (18.3-1) 
n n 
d\n j=1 


The values for n < 40 are shown in figure|18.3-A| The sequence is entry A000031|in [312]. 


The number of Lyndon words (aperiodic necklaces) equals 


1 1 
L, = Eua = by usas (18.3.2) 
d\n d\n 


The Mobius function y is defined in relation|37.1-6 on page 705| The values for n < 40 are given in figure 
18.3-B| The sequence is entry A001037 in [312]. Replacing 2 by k in the formulas for N, and L, gives 


expressions for k-ary necklaces and Lyndon words. 


For prime n = p we have L, = Np — 2 and 


1 = p 
zx (1) (18.33) 
p pé 


The latter form tells us that there are exactly (2) /p Lyndon words with k ones for 1 < k < p— 1. The 
difference of 2 is due to the necklaces that consist of all zeros or ones. The number of irreducible binary 


polynomials (see section |40.6 on page 843) of degree n also equals L,. For the equivalence between 
necklaces and irreducible polynomials see section |40.10 on page 856 


Let d be a divisor of n. There are 2" binary words of length n, each having some period d that divides 
n. There are d different shifts of the corresponding word, thereby 


35 = M dig (18.3-4) 
d\n 


Mobius inversion gives relation [18.3-2| The necklaces of length n and period d are a concatenation of 
n/d Lyndon words of length d, so 


Nr = M (18.3-5) 
d\n 
We note the relations (see section [37.2 on page 709) 
(1-22) = ][ (1-25 (18.3-6a) 
k=1 
Ste a io: 18.3-6b 
a = Y y; log (1-22) (18.3-6b) 
k=1 k=1 
Defining 
Np(@) :— Il (1— Bz") (18.3-7a) 
k=1 
we have 
mía) = [[a-2z5^ (18.3-7b) 
k=1 
mía) = [[mG5^ (18.3-7c) 


18.3: The number of binary necklaces 


381 


n: Na [Neo Nano |Nin.2y Nas Nas Nas Nas [Nan [Nas Nas | N (n,10) 
1: 2 1 1 

2: 3 1 1 1 

3: 4 1 1 dl 1 

4: 6 1 1 2 1 T 

5: 8 1 1 2 2 1 1 

6: 14 1 1 3 4 3 1 1 

T: 20 1 1 3 5 5 3 1 1 

8: 36 1 1 4 7 10 7 4 1 1 

9: 60 1 1 4 10 14 14 10 4 1 1 
10: 108 1 1 5 12 22 26 22 12 5 1 1 
11: 188 1 1 5 15 30 42 42 30 15 5 1 
12: 352 1 1 6 19 43 66 80 66 43 19 6 
13: 632 1 1 6 22 55 99) 132| 132 99 55 22 
14: 1182 1 1 7 26 73) 143| 217) 246) 217| 143 73 
15: 2192 1 1 7 31 91, 201} 335) 429| 429 335 201 
16: 4116 1 1 8 35| 116 273| 504| 715} 810) 715 504 
17: 7712 1 1 8 40) 140| 364) 728} 1144| 1430| 14380) 1144 
18: | 14602 1 1 9 46 172) 476| 1038} 1768| 2438) 2704| 2438 
19: | 27596 1 1 9 51) 204 612| 1428| 2652| 3978| 4862| 4862 
20: | 52488 1 1 10 57) 245) 776| 1944| 3876| 6310| 8398| 9252 

Figure 18.3-C: The number N(,,.) of binary necklaces of length n with z zeros. 

n: La |L£(n,0) Lin) |Lin,2) | L£m,3) |D(n,4) |D(n,5) |Lin,6) | Lin, | Ln.) |L(n,9) | L(n,10) 
1: 2 1 1 

2: a 0 1 0 

3: 2 0 1 1 0 

4: 3 0 1 1 1 0 

5: 6 0 1 2 2 1 0 

6: 9 0 1 2 3 2 1 0 

T: 18 0 1 3 5 5 3 1 0 

8: 30 0 1 3 7 8 7 3 1 0 

9: 56 0 1 4 9 14 14 9 4 1 0 

10: 99 0 1 4 12 20 25 20 12 4 1 0 
Lae 186 0 1 5 15 30 42 42 30 15 5 1 
42: 335 0 1 5 18 40 66 75 66 40 18 5 
13: 630 0 1 6 22 55 99| 132 132 99 55 22 
14: 1161 0 1 6 26 70| 143| 212 245| 212, 143 70 
15: 2182 0 1 7 30 91} 200, 333, 429) 429, 333 200 
16: 4080 0 1 T 35| 112| 273, 497, 715) 800| 715 497 
17: 7710 0 1 8 40| 140. 364| 728) 1144| 1430| 1430| 1144 
18: | 14532 0 1 8 45| 168, 476| 1026) 1768| 2424| 2700| 2424 
19: | 27594 0 1 9 51| 204| 612| 1428| 2652| 3978) 4862| 4862 
20: | 52377 0 1 9 57| 240| 775| 1932| 3876| 6288) 8398| 9225 


Figure 18.3-D: The number L,,,.) of binary Lyndon words of length n with z zeros. 
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18.3.1 Binary necklaces with fixed density 


Let Nino) be the number of binary length-n necklaces with exactly no zeros (and nı = n — no ones) the 
necklaces with fixed density. We have 


Nau) = = > e) /j (18.3-8) 
n, no/j 
JN gcd(n,no) 


Bit-wise complementing gives the symmetry relation Ninno) = N(njn—no) = N(nni). A table of small 
values is given in figure|18.3-C 


Let Lino) be the number of binary length-n Lyndon words with exactly no zeros (Lyndon words with 
fixed density), then 


1 Af n/j 

Len = = > wp m (18.3-9) 
n no/ j 

jN gcd(n,no) 


The symmetry relation is the same as for N(, 4). A table of small values is given in figure|18.3-D 


18.3.2 Binary necklaces with even or odd weight 


Summing Non,k) Over all even or odd k € n gives the number of necklaces of even (symbol E,,) or odd 
(On) weight, respectively. The first few values, the differences E,, — On, and the sums En + On = Nn: 


Neckl. n: 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
En: 1 2 2 4 4 8 10 20 30 56 94 180 316 596 1096 2068 3856 

On: 1 1 2 2 4 6 10 16 30 52 94 172 316 586 1096 2048 3856 
E,-On: 010.20 2 0 4 0 4 0 8 0 10 0 20 0 
E +On: 2 3 4 6 8 14 20 36 60 108 188 352 632 1182 2192 4116 7712 


The number of Lyndon words of even (en) and odd (on) weight can be computed in the same way: 


Lyn. m: 1 23 #4 5 6 7 8 9 10 11 12 13 14 15 16 17 | 
en 0 0 1 13 4 9 14 28 48 93 165 315 576 1091 2032 3855 
On 1 11 2 3 5 9 16 28 51 93 170 315 585 1091 2048 3855 

en — On? 1 1 0 1 0 1 0-2 0 -8 0 —5 0  —9 0  —16 0 

en + On : 1 1 2 3 6 9 18 30 56 99 186 335 630 1161 2182 4080 7710 


The differences between the number of necklaces and Lyndon words are: 


n: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | 
En-en: 1 2 1 3 1 4 162 8 1 15 120 5 36 1 
On-on: 00 1011102 1 1 2 1 1 5 0 1 
Ej,—04,: 01 12 13 142 5 110 111 5 20 1 
On-en: 1 1 1 1 12 122 4 1 7 1 10 5 16 1 


18.3.3 Necklaces with fixed content 


Let Nino .m:,....n;-1) be the number of k-symbol length-n necklaces with nj; occurrences of symbol j, the 
number of such necklaces with fixed content, we have (n = > /;.., nj and): 
(n/d) 
N = 18.3-10 
Metter ae) 2 l => (ng-1/d)! ( 
where g = gcd(no,n1,...,nx-1). The equivalent formula for the Lyndon words with fixed content is 
(n/d)! 
— rs 2»: u(d (18.3-11) 


Gaal. --(ng-1/d)! 
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where g = gcd(no,mi,...,ng-1). The relations are taken from [289] and [300], which also give efficient 
algorithms for the generation of necklaces and Lyndon words with fixed density and content, respectively. 


The number of strings with fixed content is a multinomial coefficient, see relation |13.2-1a on page 296 


A method for the generation of all necklaces with forbidden substrings is given in [290]. 


18.4 Sums of roots of unity that are zero t 


bitstring subset 

lt sra 1 (empty sum) 

2 as Tares 1 6 06 

3: sotilesa 1d “6 0167 

4: lidad. 4 048 cyclic shifts are 159, 2610, 37 11 
5: Ads tl 06 0268 

6: .11..1..11 12L 01478 Lyndon word 

Yi ...111...111 6 012678 

OF sadetdaclisd 3 0369 

9: .1.11..1.11 6 013679 
10: ..11..11..11 4 014589 
11: ..11.1..11.1 6 023689 
12: .11.11..111 12L 0125689 Lyndon word 
13: ..1111..1111 6 01236789 
14: .1.1.1.1.1.1 2 02468 10 
15: .1.111.1.111 6 0122467810 
16: .11.11.11.11 3 0134679 10 
17: .111.111.111 4 01245689 10 
18: .11111.11111 6 012346789 10 
19: 111111111111 1 0123456789 10 11 (all roots of unity) 


Figure 18.4-A: All subsets of the 12-th roots of unity that add to zero, modulo cyclic shifts. 


Let w = exp(27 i/n) be a primitive n-th root of unity and S be a subset of the set of n elements. We 
compute all S such that og = 0 where og :— $5, es w° [FXT: |comb/root-sums-demo.cc|. If cs = 0 then 
wt og = 0 for all k, so we can ignore cyclic shifts, see figure For n prime only the empty set 
and all roots of unity add to zero (no proper subset of all roots can add to zero: w would be a root of a 
polynomial that has the cyclotomic polynomial Y, = 1 4- z 4-... 4- z"-! as divisor which is impossible). 


All necklaces that are not Lyndon words correspond to a zero sum. The smallest nontrivial cases where 
Lyndon words lead to zero sums occur for n = 12 (marked with ‘L’ in figure |18.4-A]. 


Sequence |A164896 in gives the number of subsets adding to zero (modulo cyclic shifts), sequence 
A110981 the number of subsets that are Lyndon words and A103314 the number of subsets where cyclic 
shifts are considered as different. 


OAU N - 
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Chapter 19 


Hadamard and conference matrices 


The matrices corresponding to the Walsh transforms (see chapter |23 on page 459) are special cases of 
Hadamard matrices. Such matrices also exist for certain sizes N x N for N not a power of 2. We give 


construction schemes for Hadamard matrices that come from the theory of finite fields. 


If we denote the transform matrix for an N-point Walsh transform by H, then 
HH’ = Nid (19.0-1) 
where id is the unit matrix. The matrix H is orthogonal (up to normalization) and its determinant equals 
det(H) = det (HHT)? = NN (19.0-2) 


Further, all entries are either +1 or —1. An orthogonal matrix with these properties is called a Hadamard 
matriz. We know that for N — 2" we always can find such a matrix. For N — 2 we have 


(19.0-3) 


and we can use the Kronecker product (see section |23.3 on page 462) to construct Hoy from Hy via 


+H y/2 +H y/2 


= H,®H 19.0-4 
+H y/2 —H y/2 | 2 N/2 ( ) 


n= | 


The problem of determining Hadamard matrices (especially for N not a power of 2) comes from combi- 
natorics. Hadamard matrices of size N x N can only exist if N equals 1, 2, or 4k. 


19.1 Hadamard matrices via LFSR 


We start with a construction for certain Hadamard matrices for N a power of 2 that uses m-sequences 


that are created by shift registers (see section [41.1 on page 864). Figure|19.1-A| shows three Hadamard 


matrices that were constructed as follows: 
1. Choose N = 2" and create a maximum length binary shift register sequence S of length N — 1. 
2. Make S signed, that is, replace all ones by —1 and all zeros by 4-1. 


3. The N x N matrix H is computed by filling the first row and the first column with ones and filling 
the remaining entries with cyclic copies of s: for r = 1, 2,... N — 1 and c = 1,2, ... N— 1 set 
H,.. = Se—r+1 mod N-1-+ 


The matrices in figure|19.1-A|were produced with the program [FXT: ¡comb/hadamard-srs-demo.cc. . 


#include "bpol/lfsr.h" // class lfsr 
#include "auxi/copy.h" // copy_cyclic() 


#include "matrix/matrix.h" // class matrix 
typedef matrix<int> Smat; // matrix with integer entries 
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Signed SRS 


HERA E A AA AAA A A ++ 


II A A 
++i ttl 182814111148 
FI++ 1141811114848 
A ll sl si x 
++i 1414111144881 


+ 


-+++-++- 
Hadamard matrix H: 


TE ITI gG gg gctctgI tt 
Ir d-F dg ttt) tb ls IS ls 
Ptr ti tt tr cct E e E 
PIELE EFREN AAt 


Pt LILIA ARI A A DI RAR t+ 


+11114+++14++11+818 


|l treet tite gd glczt dg! tte c 
lr gd E ie G9 ng cn GB GC dl g cR gd! cB dg! cx 
|I E E Gd Gn tee rei cz pb gp cx 
LEFF c9 cg eter EIFI E 


Signed SRS: 


++++++++ 
tei tr pg 


-++-4+-- 
Hadamard 


t+itti itt 


+i ti ctt 
+1114++1 


matrix H: 


+ 


Prreeitt 
|o tz dcl br dz d 


+ 


t++ti+0 i+ 


Signed SRS: 
-+- 
Hadamard matrix H: 
+ + 


+ 


litt 


+ 
+ 
+ 


Figure 19.1-A: 
length. Only the 


Hadamard matrices created with binary shift register sequences (SRS) of maximum 
sign of the entries is given, all entries are 4 


ET. 


6 

7 [--snip--] 

8 ulong n = 5; 

9 ulong N = 1UL << n; 

10 [--snip--] 

11 

12 // --- create signed SRS: 

13 int vec[N-1]; 

14 lfsr S(n); 

15 for (ulong k=0; k<N-1; ++k) 

16 { 

17 ulong x = 1UL & S.get aO; 

18 vec[k] = (x? -1 : +1); 

19 S.next(); 

20 } 

21 

22 // --- create Hadamard matrix: 

23 Smat H(N,N); 

24 for (c=0; c«N; ++c) H.set(0, c, +1); // first row = [1,1,1,...,1] 
25 for (ulong r-1; r«N; ++r) 

26 1 

2f H.set(r, 0, +1); // first column = [1,1,1,...,1]^T 
28 copy.cyclic(vec, H.rowp_[r]+1, N-1, N-r); 
29 

30 [--snip--] 


The function copy. cyclicO is defined in [FXT: aux1/copy.h : 


template «typename Type» 


inline void copy cyclic(const Type *src, Type *dst, ulong n, ulong s) 


// Copy array src[] to dst[] 
// starting from position s in src[] 


(src[n-1]) 


// src[] is assumed to be of length n 
// dst[] must be length n at least 


10  // Equivalent to: 


11 4 
12 ulong k = 0; 
13 while ( s<n ) 


0 


1 
2 
3 
4 
5  // wrap around end of src[] 
6 
7 
8 
9 


{ acopy(src, dst, n); rotate_right(dst, n, s)} 


dst [k++] 


s = 0; 
16 while ( k<n ) dst[k++] 
} 


src[s++]; 


src[s++]; 
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If we define the matrix X to be the (N — 1) x (N — 1) block of H obtained by deleting the first row and 
column, then we have 


N-1 1 1 —1 

-1 N-1 -1 =] 
XX = [A Sl NeeL Ae Al (19.1-1) 

1 1 1 N-—1 
Equivalently, for the (cyclic) auto-correlation of S (see section |41.6 on page 875): 
L-1 
B +L ifr=0 

5 Sk Sk+r mod L = { —] otherwise (19.1-2) 


k=0 
where L = N — 1 is the length of the sequence. 


An alternative way to find Hadamard matrices of dimension 2” is to use the signs in the multiplication 


table for hypercomplex numbers described in section |39.14 on page 815 


19.2 Hadamard matrices via conference matrices 


Quadratic characters modulo 13: Quadratic characters modulo 11: 
0 - +----++-+ + - + ---+- 
14x14 conference matrix C: 12x12 conference matrix C: 
Oo+ +t HA ++ +++ +++ O+ +++ +++++++ 
+O+-++----++-+4 -O+-+++4+---4- 
++0+-++----++- --0+-+++---+ 
+-+0+-+t+t+----++ -+-0+-+++--- 
++-+0+-++----+ --+-0+-+++-- 
+++-+0+-++---- ---+-0+-+++- 
+-++-+0+-++--- ----+-0+-+++ 
+--++-+0+-++-- -+---+-0+-++ 
+---++-+0+-++- -++---+-0+-+ 
+----++-+0+-++ -+++---+-O0+- 
t++----++-+04+-+4 --+++---+-0H 
+++----++-+0+- -+-+++---+-0 
+-++----++-+0+ 
++-++----++-+0 


Figure 19.2-A: Two Conference matrices, the entries not on the diagonal are +1 and only the sign is 
given. The left is a symmetric 14 x 14 matrix (13 = 1 mod 4), the right is an antisymmetric 12 x 12 
matrix (11 = 3 mod 4). Replacing all diagonal elements of the right matrix with +1 gives a 12 x 12 
Hadamard matrix. 


12x12 Hadamard matrix H: Quadratic characters modulo 5: 
+++t+t++-+++++ +--+ 

e ee e: a a 6x6 conference matrix C: 
++++--++-+-- Ot+t++++ 
+-+++-+-+-+- +0+--+ 
+--+t+++--+-+ ++0+-- 
++--++++--+- +-+0+- 
-+++++------ +--+0+ 
+-+--+---++- ++--+0 
t++-+----- - + + 

t+-+-+--+---+4 

+--+ -+-+4+--- 

t+ -+---++-- 


Figure 19.2-B: A Hadamard matrix (left) created from a symmetric conference matrix (right). 


A conference matriz Cg is a Q x Q matrix with zero diagonal and all other entries +1 so that 
CoC = (Q-1)id (19.2-1) 


We give an algorithm for computing a conference matrix Cg for Q = q + 1 where q is an odd prime: 


00 DCI un e 
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1. Create a length-q array S with entries Sj, € {—1, 0, +1) as follows: set So = 0 and, for 1 < k < q 
set Sk = +1 if k is a square modulo q, Sk = —1 else. 


2. Set y = 1 if q = 1 mod 4, else y = —1 (then q = 3 mod 4). 


3. Set Coo = 0 and Cg[0,k] = +1 for 1 € k < Q (first row). Set Colk,0] = y fr 1 < k < Q 
(first column). Fill the remaining entries with cyclic copies of S: for 1 < r < q and 1 € c < q set 
Calr, c] = Ser+1 mod q: 


The quantity y tells us whether Cg is symmetric (y = +1) or antisymmetric (y = —1). If Cg is 
antisymmetric, then 


Hg = Co+ id (19.2-2) 


is a Q x Q Hadamard matrix. For example, replacing all zeros in the 12 x 12 matrix in figure|19.2-A| by 
+1 gives a 12 x 12 Hadamard matrix. If Cg is symmetric, then a 2Q x 2Q Hadamard matrix is given by 


Hag = i id+Cg id zd] (19.2-3) 


id+Cg —id-Co 


Figure|19.2-B|shows a 12 x 12 Hadamard matrix that was created using this formula. The construction 
of Hadamard matrices via conference matrices is due to Raymond Paley. 


The program [FXT: comb/conference-quadres-demo.cc| outputs for a given q the Q x Q conference matrix 


and the corresponding Hadamard matrix: 


#include "mod/numtheory.h" // kronecker() 
#include "matrix/matrix.h" // class matrix 
ttinclude "aux1/copy.h" // copy_cyclic() 
[--snip--] 
int y = ( 1==q%4 ? +1 : -1 ); 
ulong Q = q*i; 
[--snip--] 
// --- create table of quadratic characters modulo q: 


int vec[q]; fill<int>(vec, q, -1); vec[0] = 0; 


for (ulong k-1; k<(q+1)/2; ++k) vec[(k*k)%q] = +1; 
[--snip--] 
// --- create Q x Q conference matrix: 
Smat C(Q,Q); 
C.set(0,0, 0); 
for (ulong c=1; c«Q; ++c) C.set(0, c, +1); // first row = [1,1,1,...,1] 
for (ulong r-1; r«Q; ++r) 
C.set(r, 0, y); // first column = *-[1,1,1,...,1]^T 
copy_cyclic(vec, C.rowp_[r]+1, q, Q-r); 
[--snip--] 
// --- create a N x N Hadamard matrix: 
ulong N = ( y<O ? Q : 2*Q ); 
Smat H(N,N) ; 
if ( N==Q ) 
copy(C, H); 


H.diag add val(1); 
else 


Smat K2(2,2); K2.fi11(+1); K2.set(1,1, -1); // K2 = [+1,+1; +1,-1] 
H.kronecker(K2, C); // Kronecker product of matrices 

for (ulong k-0; k«Q; ++k) // adjust diagonal of sub-matrices 

1 


ulong r, c; 

r-k; c-k; H.set(r,c, H.get(r,c)+1); 
r-k; c=k+Q; H.set(r,c, H.get(r,c)-1); 
r=k+Q; c=k; H.set(r,c, H.get(r,c)-1); 
r=k+Q; c=k+Q; H.set(r,c, H.get(r,c)-1); 


[--snip--] 


Q OMAN 


00 ZIDANE 


00 JOANA 
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If both H, and H, are Hadamard matrices (of dimensions a and b, respectively), then their Kronecker 
product Ha, = Ha ® H; is again a Hadamard matrix: 


Ha HI, = (H,0H,) (H,9H,)” =* (H,9H;) (HL 9 HT) = (19.2-4a) 
= (H,HI)&9(H,H;) =* (aid) @ (bid) = abid (19.2-4b) 


The starred equalities use relations |23.3-11a| and |23.3-10a| on page respectively. 


19.3 Conference matrices via finite fields 


The algorithm for odd primes q can be modified to work also for powers of odd primes. We have to work 
with the finite fields GF(q”). The entries C,41,¢41 for r 20,1, ..., q” — 1 and c = 0, 1,..., q” — 1 
have to be the quadratic character of zp — Ze where zo, 21, ..., 2g» 1 are the elements in GF(q”) in some 
(fixed) order. 


We give two simple GP routines that map the elements z; € GF(q”) (represented as polynomials modulo 
q) to the numbers 0, 1, ..., q” — 1. The polynomial p(x) = co + cix +... + c, 12"-! is mapped to 
N=cotcaiqt... o € q^. 


pol2num(p,q)- 

\\ Return number for polynomial p. 

1 
p = lift(p); \\ remove mods, e.g. p-Mod(2, 3)*x^2 + Mod(1, 3) --> 2*x^241 
return ( subst(p, ’x, q) ); 

} 


The inverse routine is 


num2pol (n,q)= 
\\ Return polynomial for number n. 
{ 

local(p, mq, k); 

p = Pol(0,’x); 

k = 0; 

while ( O!=n, 

mq =n % dq; 

mq * Cx)^k; 
mq; 
q; 


muB Bid 
t0. 
"on won 


); 
return( p ); 


} 


The quadratic character of an element z can be determined by computing z(?'-)/? modulo the field 
polynomial. The result will be zero for z = 0, else +1. 


For our purpose its is better to precompute a table of the quadratic characters for later lookup: 


quadcharvec(fp, q)= 
\\ Return a table of quadratic characters in GF(q^n) 
\\ fp is the field polynomial. 
1 
local(n, qn, sv, pl); 
n-poldegree(fp); 
qn-q^n-1; 
sv=vector(qn+1, j, -1); 
sv[1] = 0; 


for (k=1, qn, 


pl = num2pol(k,q); 

pl = Mod(Mod(1,q)*pl, fp); 
sq = pl * pl; 

sq = lift(sq); \\ remove mod 


i = pol2num( sq, q ); 
sv[i+1] = +1; 
) . 


, 
return( sv ); 
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With this table we can compute the quadratic characters of the difference of two elements efficiently: 


getquadchar v(ni, n2, q, fp, sv)= 
\\ Return the quadratic character of (n2-n1) in GF(q^n) 
\\ Table lookup method 
1 

local(pi, p2, d, nd, sc); 

if ( ni--n2, return(0) ); 

pi = num2pol(ni, q); 

p2 = num2pol(n2, q); 

d = (p2-p1) % fp; 

nd = pol2num(d, q); 

sc = sv[nd*1]; 
i return( sc ); 


Now we can construct conference matrices: 


matconference(q, fp, sv)= 
\\ Return a QxQ conference matrix. 
\\ q an odd prime. 
\\ fp an irreducible polynomial modulo q. 
\\ sv table of quadratic characters in GF(q^n) 
\\ where n is the degree of fp. 
1 
local(y, Q, C, n); 
n = poldegree(fp); 
Q-q^n*1; 
if ( sv[2]==sv[Q-1], y=+1, y=-1 ); \\ symmetry 
C = matrix(Q,Q); 
for (k=2, Q, C[1,k]=+1); AN first row 
for (k-2, Q, C[k,11=y); \\ first column 
for (r=2, Q, 
for (c=2, Q, 
sc = getquadchar v(r-2, c-2, q, fp, sv); 
C[r,c] = sc; 
); 
return( C ); 
} 


q=3 fp=x°2+1 GF(372) 
Table of quadratic characters: 


O+++--+-- 
10x10 conference matrix C: 
+++++++ + 
-0+++--+-- 
-+0+-+--+- 
-++0--+--+4 
-+--O+++4+-- 
--+-4+0+-4- 
---+++0--+4 
=+--+--0++ 
--+--+-+0+ 
---+--+++0 


Figure 19.3-A: A 10 x 10 conference matrix for q = 3 and the field polynomial f = x? + 1. 


To compute a Q x Q conference matrix where Q = q” + 1 we need to find a polynomial of degree n that 
is irreducible modulo q. With q = 3 and the field polynomial f = z? + 1 (so n = 2) we get the 10 x 10 
conference matrix shown in figure [19.3-A] A conference matrix for q = 3 and f = x? — x + 1 is given in 
figure [19.3-B] Hadamard matrices can be created in the same manner as before, the symmetry criterion 
being whether q” = +1 mod 4. 


The conference matrices obtained are of size Q = q” + 1 where q is an odd prime. The values Q < 100 
are (see sequence A061344 in [312]): 


4, 6, 8, 10, 12, 14, 18, 20, 24, 26, 28, 30, 32, 38, 42, 44, 48, 
50, 54, 60, 62, 68, 72, 74, 80, 82, 84, 90, 98 


Our construction does not give conference matrices for any odd Q, and these even values Q < 100: 
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q=3 fp-2x3-x-*1 GF(373) 
Table of quadratic characters: 


O+----++4+ 4+ -4+ 4+ 4+ -4+ 4+ -°---4+ -4+--4+- 
28x28 conference matrix C: 
mor o 
-O+----+t+ 44+ -4+ 4+ 4+ -4+ 4-7-7 -4+ -4+--4+- 
--O+---+4+4+4+ 4+ --4+ 4+ -4+ 4+4+----4+--+4 
-+-QO0---++4+-4+ 4+ 4+ -4+ 4+ -4+ -4+ -4+ --4+-- 
-+++0+4+----+ 4+ -4+ -4+4+4+--4+---4+-4- 
-+++-O0+----+4+ 44+ --4+ 4+ --4+4+----4 
-++4+4+-O0---4+ -4+-4+ 4+ 4+ -4+ 4+ ---4+-4+-- 
----++4++ 04+ -4+ 4+ -4+ 4+ -4+ -4+ -4+--4+---+4 
----t+4+4+-O4+-7-4+4+ -4+ 4+ 4+ 4-7-7 -4+--4+ 4-7 
----t++4+4+-O0+ -4+ 4+ -4+ -4+ 4+ 4+ --4+---4+- 
---+-4--4+-04+----+ 4+ 4+ 4+ -4+ 4+ 4+ c4 
-+----4+ --4+-O04+---4+ ¢4+4+4+--4+ 4+ -4+ 4 
--+-4+--4+--4+-0--- +4 4+ -4+ 44+ -4+ 4-4 
--+---+ -4+-4+ 4+4+04+----4+ 4+ -4+ -4+ 4+ 4+- 
---+4+----4¢+ 4+ 4+4+-04+----4+ 4+ 4+4+--4+ 4 
-+---4 -4+ --4+ ¢+4+4+-0---4+-4+ -4+ 4+ 4-4 
--+--4---4 --- + 4+4+04+ -4+ 4+ -4+ 4-4-4 
---+--4+4+----- t++e+-O+-4+ 4+ -4+ 4+4+4+- 
-+--+---+----++++-0+-++-+-++ 
-+-+++-++---+-+--+-0+----+H++ 
-++--4 4+ -4+ 44+ ----4+--4+-04+---4+ 44 
--++4+-4+4+-4 -4+ -4+ --4+--4+-0---4+ 44 
-++ -4+ -4+ 4+ 4+ --4+---4+ -4+ -4+4+4+04+---- 
--++4+ 4 --4+ 4+ --+4+----+4+4+++-0+--- 
-+-4+-4 4+ 4+ -4+ 4+ ---4+ -4+--4+ 4+4+4+-0--- 
-++-++-+-+-+--+---+---+++0+- 
--++-++++---+--++----- +++-0+ 
-+-++-+-+t++--+---+----++++-0 


Figure 19.3-B: A 28 x 28 conference matrix for q = 3 and the field polynomial f = z? — x + 1. 


2, 16, 22, 34, 36, 40, 46, 52, 56, 58, 64, 66, 70, 76, 78, 86, 88, 92, 94, 96, 100 
For example, Q = 16 = 15 + 1 = 3-5 + 1 has not the required form. 


If a conference matrix of size Q exists, then we can create Hadamard matrices of sizes N = Q whenever 
q” = 3 mod 4 and N = 2 Q whenever q” = 1 mod 4. Further, if Hadamard matrices of sizes N and M 
exist, then a (N - M) x (N - M) the Kronecker product of those matrices is a Hadamard matrix. 


The values of N = 4k < 2000 such that this construction does not give an N x N Hadamard matrix are: 


92, 116, 156, 172, 184, 188, 232, 236, 260, 268, 292, 324, 356, 372, 
376, 404, 412, 428, 436, 452, 472, 476, 508, 520, 532, 536, 584, 
596, 604, 612, 652, 668, 712, 716, 732, 756, 764, 772, 808, 836, 
852, 856, 872, 876, 892, 904, 932, 940, 944, 952, 956, 964, 980, 
988, 996, 1004, 1012, 1016, 1028, 1036, 1068, 1072, 1076, 1100, 
1108, 1132, 1148, 1168, 1180, 1192, 1196, 1208, 1212, 1220, 1244, 
1268, 1276, 1300, 1316, 1336, 1340, 1364, 1372, 1380, 1388, 1396, 
1412, 1432, 1436, 1444, 1464, 1476, 1492, 1508, 1528, 1556, 1564, 
1588, 1604, 1612, 1616, 1636, 1652, 1672, 1676, 1692, 1704, 1712, 
1732, 1740, 1744, 1752, 1772, 1780, 1796, 1804, 1808, 1820, 1828, 
1836, 1844, 1852, 1864, 1888, 1892, 1900, 1912, 1916, 1928, 1940, 
1948, 1960, 1964, 1972, 1976, 1992 


This is sequence A046116 in . It can be computed by starting with a list of all numbers of the form 
4k and deleting all values k = 2° (q + 1) where q is a power of an odd prime. 


Constructions for Hadamard matrices for numbers of certain forms are known, see [234] and [157]. 
Whether Hadamard matrices exist for all values N = 4k is an open problem. A readable source about 
constructions for Hadamard matrices is [316]. Hadamard matrices for all N < 256 are given in [313]. 
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Chapter 20 


Searching paths in directed graphs 1 


We describe how certain combinatorial structures can be represented as paths or cycles in a directed 
graph. As an example consider Gray codes of n-bit binary words: we are looking for sequences of all 2” 
binary words such that only one bit changes between two successive words. A convenient representation 
of the search space is that of a graph. The nodes are the binary words and an edge is drawn between 
two nodes if the node’s values differ by exactly one bit. Every path that visits all nodes of that graph 
corresponds to a Gray code. If the path is a cycle, a Gray cycle was found. 


Depending on the size of the problem, we can 
1. try to find at least one object, 
2. generate all objects, 
3. show that no such object exists. 


The method used is usually called backtracking. We will see how to reduce the search space if additional 
constraints are imposed on the paths. Finally, we show how careful optimization can lead to surprising 
algorithms for objects of a size where one would hardly expect to obtain a result at all. In fact, Gray 
cycles through the n-bit binary Lyndon words for all odd n < 37 are determined. 


We use graphs solely as a tool for finding combinatorial structures. For algorithms dealing with the 
properties of graphs see, for example, [220] and [307]. 


Terminology and conventions 


We will use the terms node (instead of vertex) and edge (sometimes called arc). We restrict our attention 
to directed graphs (or digraphs) as undirected graphs are just the special case of these: an edge in an 
undirected graph corresponds to two antiparallel edges (think: ‘arrows’) in a directed graph. 


A length-k path is a sequence of nodes where an edge leads from each node to its successor. A path is 
called simple if the nodes are pair-wise distinct. We restrict our attention to simple paths of length N 
where N is the number of nodes of the graph. We use the term full path for a simple path of length N. 


If in a simple path there is an edge from the last node of the path to the starting node the path is a cycle 
(or circuit). A full path that is a cycle is called a Hamiltonian cycle, a graph containing such a cycle is 
called Hamiltonian. 


We allow for loops (edges that start and point to the same node). Graphs that contain loops are called 
pseudo graphs. The algorithms used will effectively ignore loops. We disallow multigraphs (where multiple 
edges can start and end at the same two nodes), as these would lead to repeated output of identical objects. 


The neighbors of a node are those nodes to which outgoing edges point. Neighbors can be reached with 
one step. The neighbors of a node a called adjacent to the node. The adjacency matrix of a graph with 
N nodes is an N x N matrix A where A; j = 1 if there is an edge from node i to node j, else A; j = 0. 
While easy to implement (and modify later) we will not use this kind of representation as the memory 
requirement would be prohibitive for large graphs. 
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20.1 Representation of digraphs 


For our purposes a static implementation of the graph as arrays of nodes and (outgoing) edges will suffice. 
The container class digraph merely allocates memory for the nodes and edges. The correct initialization 
is left to the user [FXT: class digraph in graph/digraph.h |: 


class digraph 


{ 
public: 
ulong ng ; // number of Nodes of Graph 
ulong *ep ; // elep[k]], ..., elep[k*il-1]: outgoing connections of node k 
ulong *e.;  // outgoing connections (Edges) 
ulong *vn_; // optional: sorted values for nodes 


// if vn is used, then node k must correspond to vn[k] 


public: 
digraph(ulong ng, ulong ne, ulong *&£ep, ulong *ke, bool vnq=false) 
: ng (0), ep. (0), e. (0), vn. (0) 


ng_ = ng; 
ep. = new ulong[ng_+1]; 
e. = new ulong[ne] ; 
ep = ep-; 
e= e_; 
if ( vnq ) vn. = new ulong[ng ]; 
J: 
22 ~digraph() 
24 delete [] ep_; 
25 delete [] e_; 
26 if (vn. ) delete [] vn_; 
27 } 
8 
29 [--snip--] 
31 void get edge idx(ulong p, ulong &fe, ulong &en) const 
32 // Setup fe and en so that the nodes reachable from p are 
33 // e[fel, e[fe*1], ..., e[en-1]. 
34 // Must have: 0<=p<ng 
35 1 
36 fe = ep [p];  // (index of) First Edge 
3T en = ep [p*il; // (index of) first Edge of Next node 
38 F 
9 
40 [--snip--] 
41 void print(const char *bla-0) const; 


42 
The nodes reachable from node p could be listed using 


// ulong p; // == position 

cout << "The nodes reachable from node " << p << " are:" << endl; 
ulong fe, en; 

E..get edge idx(p, fe, en); 

for (ulong ep-fe; ep<en; ++ep) cout << e [ep] << endl; 


With our representation there is no cheap method to find the incoming edges. We will not need this 
information for our purposes. If the graph is known to be undirected, the same routine obviously lists 
the incoming edges. 


Initialization routines for certain digraphs are declared in [FX T: graph/mk-special-digraphs.h|. A simple 
igraph /mk-complete-digraph.ce 


example is [FXT: graph/mk-complete-digraph.cc : 


digraph 

make. complete. digraph(ulong n) 

// Initialization for the complete graph. 
1 


ulong ng = n, ne = n*(n-1); 


ulong *ep, *e; 
digraph dg(ng, ne, ep, e); 


0 O CU NS 


CONDI io NA 
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ulong j = 0; 

for (ulong k-0; k«ng; ++k) // for all nodes 
ep[k] = j; 
for (ulong i=0; i<n; ++i) // connect to all nodes 
1 


if ( k==i ) continue; // skip loops 
elj++] = i; 


ep[ng] = j; 
return dg; 


} 


We initialize the complete graph (the undirected graph that has edges between any two of its nodes) for 
n = 5 and print it [FXT: graph/graph-perm-demo.cc |: 


digraph dg - make, complete digraph(5); 
dg.print("Graph -"); 


'The output is 


Graph - 

Node: Edge0 Edgel ... 
0: 1 2 3 4 
1: 0 2 3 4 
2: Q 1 3 4 
3: Q 1 2 4 
4: 0 1 2 3 
#nodes=5 #edges=20 


For many purposes it suffices to implicitly represent the nodes as values p with 0 < p < N where N is 
the number of nodes. If not, the values of the nodes have to be stored in the array vn_[]. One such 
example is a graph where the value of node p is the p-th (cyclically minimal) Lyndon word that we will 
meet at the end of this chapter. To make the search for a node by value reasonably fast, the array vn. [] 
should be sorted so that binary search can be used. 


20.2 Searching full paths 


To search full paths starting from some position py we need two additional arrays for the bookkeeping: 
A record rv. [] of the path so far, its k-th entry is px, the node visited at step k. A tag array qq. [] that 
contains a one for nodes already visited, otherwise a zero. The crucial parts of the implementation are 


[FXT: class digraph paths in |graph/digraph-paths.h : 


class digraph. paths 
// Find all full paths in a directed graph. 
bs 
digraph &g ; // the graph 
ulong *rv_; // Record of Visits: rv[k] == node visited at step k 
ulong *qq.; // qq[k] == whether node k has been visited yet 
[--snip-- 


// function to call with each path found with all pathsO: 
ulong (*pfunc.)(digraph paths 4); 

[--snip--] 
// function to impose condition with all, cond paths): 
bool (*cfunc )(digraph paths €, ulong ns); 


public: 
// graph/digraph.cc: 


digraph_paths(digraph &g); 

~digraph_paths() ; 
[--snip--] 

bool path is. cycle() const; 
[--snip--] 

void print path() const; 
[--snip--] 


// graph/digraphpaths-search.cc: 
ulong all. paths(ulong (*pfunc)(digraph paths &), 
ulong ns-0, ulong p-0, ulong maxnp-0); 


00 DOHA wn 


10 
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private: 
void next_path(ulong ns, ulong p); // called by all_paths() 
[--snip--] 
$; 


We could have used a bit-array for the tag values qq. []. It turns out that some additional information 


can be saved there as we will see in a moment. 


To keep matters simple a recursive algorithm is used to search for (full) paths. The search is started via 
call to all_paths() [FXT: graph/digraph-paths.cc|: 


ulong 
digraph_paths::all_paths(ulong (*pfunc) (const digraph paths 4), 
ulong ns/*-0*/, ulong p/*-0*/, ulong maxnp/*=0*/) 
// pfunc: function to visit (process) paths 
// ns: start at node index ns (for fixing start of path) 
// p: start at node value p (for fixing start of path) 
// maxnp: stop if maxnp paths were found 
x pct_ = 0; 
cct_ = 0; 
pfct_ = 0; 
pfunc_ = pfunc; 
pfdone_ = 0; 
maxnp_ = maxnp; 
next_path(ns, p); 
return pfct_; // Number of paths where pfunc() returned true 
} 
The search is done by the function next_path(): 
void 
digraph_paths: :next_path(ulong ns, ulong p) 
// ns+1 == how many nodes seen 
// p == position (node we are on) 


} 


The lines that are commented with // FCT record which among the free nodes is visited. The algorithm 


if ( pfdone_ ) return; 


rv_[ns] = p; // record position 
++ns; 


if ( ns--ng ) // all nodes seen ? 
pfunc_(*this) ; 
else 


qq_[p] = 1; // mark position as seen (else loops lead to errors) 
ulong fe, en; 
g_.get_edge_idx(p, fe, en); 
ulong fct = 0; // count free reachable nodes // FCT 
for (ulong ep=fe; ep<en; ++ep) 
{ 
ulong t = g_.e_[ep]; // next node 
if ( O--qq [t] ) // node free? 
{ 


++fct; 
qq-[p] = fct; // mark position as seen: record turns // FCT 
next_path(ns, t); 

} 


} 
// if ( O==fct ) { "dead end: this is a U-turn"; } // FCT 


qq.[p] = 0; // unmark position 


still works if these lines are commented out. 


20.2: Searching full paths 395 


0: 1234 

Graph - 1: 1243 
Node: EdgeO Edgel . 2: 1324 
0: 1 2 3 4 3: 1342 
1: 0 2 3 4 4: 1423 
2: 0 1 3 4 5: 1432 
3: 0 1 2 4 6: 2134 
4: 0 1 2 3 T: 2143 
#nodes=5 #edges=20 8: 2314 

[--snip--] 

21: 4231 

22: 4312 

23: 4321 


Figure 20.2-A: Edges of the complete graph with 5 nodes (left) and full paths starting at node 0 (right). 
The paths (where 0 is omitted) correspond to the permutations of 4 elements in lexicographic order. 


20.2.1 Paths in the complete graph: permutations 


The program [FXT: graph/graph-perm-demo.cc| shows the paths in the complete graph from section 
on page 392| We give a slightly simplified version: 


1 ulong pfunc. perm(digraph paths &dp) 

2  // Function to be called with each path: 
3  // print all but the first node. 

4 4 

5 const ulong *rv = dp.rv_; 

6 ulong ng = dp.ng_; 

7 

8 cout << setw(4) << dp.pfct_ << ": "; 
9 for (ulong k=1; k<ng; ++k) cout << " " << rv[k]; 
10 cout << endl; 

B return 1; 

13 } 


13 int 


16 main(int argc, char **argv) 


17 4 

18 ulong n = 5; 

19 digraph dg = make_complete_digraph(n) ; 
20 digraph paths dp(dg) ; 

22 dg.print ("Graph ="); 

23 cout << endl; 

25 dp.all_paths(pfunc_perm, 0, 0, maxnp); 
26 return 0; 

27 $ 


The output, shown in figure|20.2- A] is a listing of the permutations of the numbers 1, 2, 3, 4 in lexicographic 
order (see section|10.2 on page 242). 


20.2.2 Paths in the De Bruijn graph: De Bruijn sequences 


The graph with 2n nodes and two outgoing edges from node k to 2k mod 2n and 2k + 1 mod 2n is 
called a (binary) De Bruijn graph. For n — 8 the graph is (printed horizontally): 


Node: 0 1 2 34 5 6 7 8 9 10 11 12 13 14 15 
Edge 0: 0 2 4 6 8101214 0 2 4 6 8 10 12 14 
Edge 1 1 3 5 7 9111315 1 3 5 7 9 11 13 15 


The graph has a loop at each the first and the last node. All paths in the De Bruijn graph are cycles, 
the graph is Hamiltonian. 


With n a power of 2 the paths correspond to the De Bruijn sequences (DBS) of length 2n. The graph 
has as many full paths as there are DBSs and the zeros/ones in the DBS correspond to even/odd values 


of the nodes, respectively. This is demonstrated in [FXT: graph/graph-debruijn-demo.cc| (shortened): 


1  ulong pq = 1; // whether and what to print with each cycle 
2 


3 ulong pfunc db(digraph paths &dp) 
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Figure 20.2-B: Edges of the De Bruijn graph (top) and all paths starting at node 0 together with the 
corresponding De Bruijn sequences (bottom). Dots denote zeros. 


// Function to be called with each cycle. 


{ 
switch ( pq ) 
{ 
case 0: break; // just count 
case 1: // print lowest bits (De Bruijn sequence) 
{ 
ulong *rv = dp.rv_, ng = dp.ng_; 
for (ulong k-0; k<ng; ++k) cout << (rv[k]&1UL ? ^1? : ?.?); 
cout << endl; 
break; 
} 
[--snip--] 
} 
return 1; 
} 


int main(int argc, char **argv) 

{ 
ulong n = 8; 
NXARG(pq, "what to do in pfunc()"); 
ulong maxnp = 0; 
NXARG(maxnp, "stop after maxnp paths (0: never stop)"); 
ulong pO - 0; 
NXARG(pO, "start position «2*n"); 
digraph dg = make debruijn digraph(n); 
digraph paths dp(dg); 
dg.print horiz("Graph ="); 
// call pfunc() with each cycle: 
dp.all paths(pfunc db, 0, pO, maxnp); 
cout << "n = " << n; 
cout << " (ng=" << dg.ng_ << ")"; 
cout << " #cycles = " << dp.cct_; 
cout << endl; 
return 0; 

} 


The macro NXARG() reads one argument, it is defined in [FXT: ¡nextarg.h]. Figure |20.2-B| was created 
with the shown program. 


The algorithm is a very effective way for generating all DBSs of a given length, the 67,108,864 DBSs of 
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length 64 are generated in 140 seconds when printing is disabled (set argument pq to zero), corresponding 
to a rate of more than 450,000 DBSs per second. 


# = = —H- - = 
== === #-—-—-##---#-#-——H###-—-#-—-#-##-—#H#-#-—H#HHH-#-H-HEH-HH-HHREHHE 


Figure 20.2-C: A path in the De Bruijn graph with 64 nodes. Each binary word is printed vertically, 
the symbols ‘#’ and ‘-’ stand for one and zero, respectively. 


Setting the argument pq to 4 prints the binary values of the successive nodes in the path horizontally, see 
figure |20.2-C| The graph is constructed in a way that each word is the predecessor shifted by one with 
either zero or one inserted at position zero (top row of figure|20.2-C). 


The number of cycles in the De Bruijn graph equals the number of degree-n normal binary pol lynomials, 
see section |42.6.3| on page [904] A closed form for the special case n = 2" is given in section on page 


20.2.3 A modified De Bruijn graph: complement-shift sequences 


HH HH HH 1-1 IEEE 7 THEE E HR HR ARA TEE 3E 1 
—IHHHHEE THEE 4 4-7 HH TEES THESE 7 IDEE 74 Bt 8-7 HH #--#-# 
—dt-—————1———1—1—1—-———1Ht-—1—- 1 1H/-- IEEE HR THES TEEIEIE- THEE (BE 
—H IHE IEEE TES ES IEEE HH HH EC 7 THEE dE THE 417-717 ----—- #--# 
HH HH AHHH HH 1-1 IEEE 7 THEE E- HR HA HR THE 
-#-H#-#HHHHH- THE T H- H--#HHH--#H-HH-H---#HH---H-#H--H----##------ #- 


Figure 20.2-D: A path in the modified De Bruijn graph with 64 nodes. Each binary word is printed 
vertically, the symbols ‘#’ and ‘-’ stand for one and zero, respectively. 


A modification of the De Bruijn graph forces the nodes to be the complement of its predecessor shifted 
by one (again with either zero or one inserted at position zero). The routine to set up the graph is [FXT: 


graph/mk-debruijn-digraph.cc |: 


digraph 
make complement shift digraph(ulong n) 
1 


1 
2 
3 
4 
5 ulong *ep, *e; 
6 


ulong ng - 2*n, ne - 2*ng; 


digraph dg(ng, ne, ep, e); 


ulong j = 0; 
9 for (ulong k=0; k«ng; ++k) // for all nodes 
10 1 
11 ep[k] = j; 
12 ulong r = (2*k) % ng; 
13 e[j++] = r; // connect node k to node (2*k) mod ng 
14 r = (2*k+1) % ng; 
15 elj++] = r; // connect node k to node (2*k+1) mod ng 
17 ep[ng] = j; 
18 // Here we have a De Bruijn graph. 
20 for (ulong k=0,j=ng-1; k<j; ++k,--j) swap2(elep[kll, elep[j11); // end with ones 
21 for (ulong k=0,j=ng-1; k<j; ++k,--j) swap2(elep[kl*11, eLep[j]+1]); 
33 return dg; 
24 y 


The output of the program [FXT: graph/graph-complementshift-demo.cc| is shown in figure |20.2-D 


For n a power of 2 the sequence of binary words has the interesting property that the changes between 
successive words depend on their sequency: words with higher sequency change in less positions. Further, 
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if two adjacent bits are set in some word, then the next word never has both bits set again. Out of a run 
of k > 2 consecutive set bits in a word only one is contained in the next word. 


See section [8.3 on page 208|for the connection with De Bruijn sequences. 
20.3 Conditional search 


Sometimes one wants to find paths that are subject to certain restrictions. Testing for each path found 
whether it has the desired property and discarding it if not is the simplest way. However, this will in 
many cases be extremely ineffective. An upper bound for the number of recursive calls of the search 
function next  path() with a graph with N nodes and a maximal number of v outgoing edges at each 
node is u = N”. 


TL 


For example, the graph corresponding to Gray codes of n-bit binary words has N = 2” nodes and 


(exactly) c = n outgoing edges at each node. The graph is the n-dimensional hypercube. 


n: N u=N*=N"=2%" 
IE 2 2 
2: 4 16 
3: 8 512 
4: 16 65,536 
5: 32 33,554,432 
6: 64 68,719,476,736 
T: 128 562,949,953,421,312 
8: 256 18,446,744,073,709,551,616 
9: 512 2,417,851,639,229,258,349,412,352 

10: | 1024 | 1,267,650,600,228,229,401,496,703,205,376 


To reduce the search space we use a function that rejects branches that would lead to a path not sat- 
isfying the imposed restrictions. A conditional search can be started via all cond paths() that has 


an additional function pointer cfunc() as argument. The function must implement the condition. The 
corresponding method is declared as [FXT: |graph/digraph-paths.h |: 


bool (*cfunc, )(digraph. paths €, ulong ns); 


Besides the data from the digraph-class it needs the number of nodes seen so far (ns) as an argument. A 
slight modification of the search routine does what we want [FXT: graph/search-digraph-cond.cc : 


1 void 
2  digraph paths::next. cond path(ulong ns, ulong p) 
3 t 
4 [--snip--] // same as next. path() 
5 if ( ns--ng ) // all nodes seen ? 
6 [--snip--] // same as next, path) 
7 else 
8 1 
9 qq. [pl] = 1; // mark position as seen (else loops lead to errors) 
10 ulong fe, en; 
11 £..get edge idx(p, fe, en); 
12 ulong fct = 0; // count free reachable nodes 
13 for (ulong ep-fe; ep<en; ++ep) 

i 
15 ulong t = g_.e_[ep]; // next node 
16 if ( O--qq [t] ) // node free? 

1 

18 rv [ns] = t; // for cfunc() 
19 if ( cfunc (*this, ns) ) 
21 i ++fct; 
22 qq [p] = fct; // mark position as seen: record turns 
23 next cond path(ns, t); 
24 } 
25 } 


} 
27 qq_[p] = 0; // unmark position 
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28 } 
29 } 


The free node under consideration is written to the end of the record of visited nodes so cfunc() does 
not need it as an explicit argument. 


20.3.1 Modular adjacent changes (MAC) Gray codes 


Qi. sees 20 0 sb O 0% ux 0 iul 0 
i: 2b d 1 sede c dig: eel, d 1 sd. d 
2: ..11 2 3 Ses 2 20 X112 3 eles 2 
3: .111 3 T Tess 3 3: .111 3 T sala -i 
4: 11114 15 Had 0 4: .1.1 2 5 sasi 0 
5: 111. 3 14 sede o bu eles d 4 cius A 
6: 11.. 2 12 sed 30 6: .11.2 6 clive. .2 
T: 11.13 13 1... 3 Ce lod 2 1... 3 
8: .1.12 5 sed. 9 8: 1.1. 2 10 Shes 2 
95. «hes, L 4 de A 9: 111. 3 14 eae 
10: .11. 2 6 sl. 2 10: 11..2 12 ceed O 
dle ui. 1 2 1... 3 11: 11.13 13 se d 
12: 1.1. 2 10 inal. 0 12: 1111 4 15 gles 2 
13: 1.113 11 sse Ud 13: 1.113 11 sdb d 
14: 1..12 9 sss. 9 14: 1..12 9 ...1 0 
15s. Tos d 8 [1... 3] 15: 1... 1 8 [1... 3] 


Figure 20.3-A: Two 4-bit modular adjacent changes (MAC) Gray codes. Both are cycles. 


We search for Gray codes that have the modular adjacent changes (MAC) property: the values of suc- 
cessive elements of the delta sequence can only change by +1 modulo n. Two examples are show in 
figure The sequence on the right side even has the stated property if the term ‘modular’ is 
omitted: It has the adjacent changes (AC) property. 


As bit-wise cyclic shifts and reflections of MAC Gray codes are again MAC Gray codes we consider paths 
starting 0 > 1 — 2 as canonical paths. 


In the demo [FXT: graph/graph-macgray-demo.cc| the search is done as follows (shortened): 


1 int main(int argc, char **argv) 

2 { 

3 ulong n = 5; 

4 NXARG(n, "size in bits"); 

5 cf_nb = n; 

6 

7 digraph dg = make_gray_digraph(n, 0); 

8 digraph paths dp(dg); 

10 ulong ns = 0, p = 0; 

11 // MAC: canonical paths start as 0-->1-->3 
12 { 

13 dp.mark(0, ns); 

14 dp.mark(1, ns); 

15 p= 3; 

16 } 

17 

18 dp.all_cond_paths(pfunc, cfunc_mac, ns, p, maxnp); 
19 return 0; 
20 } 


The function used to impose the MAC condition is: 


1 ulong cf_nb; // number of bits, set in main() 

2 bool cfunc_mac(digraph_paths &dp, ulong ns) 

A // Condition: difference of successive delta values (modulo n) == +-1 
5 // path initialized, we have ns>=2 

6 ulong p = dp.rv [ns], pi = dp.rv [ns-1], p2 = dp.rv_[ns-2]; 

7 ulong c = p ^ pi, ci- pi ^ p2; 

8 if ( c & bit_rotate_left(c1,1,cf_nb) ) return true; 

9 if ( c1 & bit_rotate_left(c,1,cf_nb) ) return true; 

10 return false; 

1 > 


RE 
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We find paths for n < 7 (n = 7 takes about 15 minutes). Whether MAC Gray codes exist for n > 8 is 
unknown (none is found with a 40 hour search). 


20.3.2 Adjacent changes (AC) Gray codes 


For AC paths we can only discard track-reflected solutions, the canonical paths are those where the delta 
sequence starts with a value € [n/2]. A function to impose the AC condition is 


ROO00 DIANA 


ulong cf_mt; // mid track < cf_mt, set in main() 
bool cfunc_ac(digraph_paths dp, ulong ns) 
// Condition: difference of successive delta values == +-1 
if ( ns<2 ) return (dp.rv [1] < cf mt); // avoid track-reflected solutions 
ulong p = dp.rv [ns], pi = dp.rv [ns-1], p2 = dp.rv [ns-2]; 
ulong c =p ^ pi, c1 = p1 ^ p2; 
if (c & (ci««1) ) return true; 
if ( ci & (c««1) ) return true; 
return false; 
} 
Qr. nix 0 0 cade 2 DS. ces 0 0 sad, Y 
1t aalas L 4 1243 1: 25.1.41 1 sad. 1 
2t: adds. 2 12 1.... 4 2: i11 2 3 wd 2 
3: 111.. 3 28 lo 3 3: ..111 3 T sies. x8 
4: 1.1.. 2 20 ..1.. 2 4: .1111 4 15 rales 2 
br De eee Dd 16 gate ~All b: .1.11 3 11 sad. E 
6: 1..1. 2 18 wala, 2 6: .1..1 2 9 en 22 
7: 1.11. 3 22 Lar 3 Y: .11.1 3 13 ¿Lis 3 
8: 1111. 4 30 lue 2 8: 2.1.1 2 5 1-2. 4 
9: 11.1. 3 26 A dae, al 9: 1.1.1 3 21 Ilio 3 
10: 11... 2 24 ed 0 10: 111.1 4 29 xdi. 32 
11: 11..1 3 25 sende d 11: 11..1 3 25 sre. L 
12: 11.11 4 2f Lcd 22 12: 11.11 4 27 takes 2 
13: 11111 5 31 wikis 83 13: 11111 5 31 Mlle 8 
14: 1.111 4 23 cud 2 14: 1.111 4 23 sells 32 
15: 1..11 3 19 selle A 15: 1..11 3 19 suada 4 
16: 1...12 17 sse 2 16: 1...12 17 seb 0 
17: 1.1.1 3 21 abere 8 lf: duos d 16 css 1 
18: 111.1 4 29 1.... 4 18: 1..1. 2 18 sades 2 
19: .11.1 3 13 bede B 19: 1.11. 3 22 sli. 33 
20: ..1.1 2 5 sodas 2 20: 1111. 4 30 walas -2 
21: cuerdo 1 1 cada A 21: 11.1. 3 26 —— 2 
22: ....11 2 3 cedya 2 22: 11... 2 24 cd 2 
23: . ..111 3 T alos sd 23:. Vil. 3 28 les <8 
24: .1111 4 15 calar 2 24: 1.1.. 2 20 1.... 4 
25: .1.11 3 11 seda L 25:5. dl. 1 4 elus 3 
26: .1..12 9 sued OQ 26: :11.. 2 12 rds 42 
Ql: dew 1 8 stead, Cdi 20> uli. L 8 wed d 
28: .1.1. 2 10 calar. 2 28: 1.1. 2 10 sli 2 
29: .111. 3 14 1:2. 3 29: .111. 3 14 1... 3 
30: ..11. 2 6 saas 2 30: ..11. 2 6 sea. -2 
3i: asia 2 [eects E 31: sl cd 2 [-.43. 1] 


Figure 20.3-B: Two 5-bit adjacent changes (AC) Gray codes that are cycles. 


The program [FXT: |graph/graph-acgray-demo.cc| allows searches for AC Gray codes. Two cycles for 
n = 5 are shown in figure|20.3-B| It turns out that such paths exist for n < 6 (the only path for n = 6 is 
shown in figure|20.3-C] but there is no AC Gray code for n = 7: 

time ./bin 7 

arg 1: 7 == n [size in bits] default=5 

arg 2: 0 == maxnp [ stop after maxnp paths (0: never stop)] default=0 
n-7 4#pfct = 0 

#paths = 0 #cycles = 0 
./bin 7 20.77s user 0.11s system 98% cpu 21.232 total 


Nothing is known about the case n > 8. For n = 8 no path is found within 15 days. 
By inspection of the AC Gray codes for different values of n we find an ad hoc algorithm. The following 
routine computes the delta sequence for AC Gray codes for n € 6 [FXT: comb/acgray.cc!: 


voi 
ac ST delta(uchar *d, ulong ldn) 


// Generate a delta sequence for an adjacent-changes (AC) Gray code 
// of length n=2**ldn where ldn«-6. 


SO Na 


20.3: Conditional search 


401 


E ps ps pr ps 


a a a a 


QO«q000-1OYO14S COND OXQ900 -IOYOTS COND IR CO 


Na dl ol e 


21: 


v pipkpipke ac 
(o pEREREAREREREAERS pl fr po fr pe pff eoe ono 


RNFPNON 0H 0) Hs UOTIS 0 Hs COND CO NO IR NO COND CO S COND COND IR N2 ROO 
0NNuUNANNDNAN 
OWORNOWRNOOFRUINWEOWNBONOARO 

E 


< ehehehehe. +. 
© ehehehe. 


26 eee 


m 
OBR WNENWNEORNWNENWBWNENWNRFORNWNEN 
"s 
N 


N 
N 
A pp SEA ES A EA ES EA E B EB Fe 


w 
pare 
BRE 


BRR pp RRR E E EB E E 


0 pj pa A ARA ARRA ARRA oo 
RN 0N COS 0 Hs 07 Hs 0 Hs COND 09 Hs CO IS 07 Hs 071 0) 07H OS COPS COND CO NO 


AES EPIS Jo pA. p E o Jl o Jr o o Jl E o Jl o o JJ E o fo fr pa 


ON PNWNFPORNWNENWABWNRNWNFORNWNENWEA 


uw 


Figure 20.3-C: The (essentially unique) AC Gray code for n = 6. While the path is a cycle in the 
graph, the AC condition does not hold for the transition from the last to the first word. 


d[j] = d[k] + 1; 


{ 
if ( ldn<=2 ) // standard Gray code 
d[0] = 0; 
if ( ldn==2 ) { d[i] = 1; d[2] = 0; } 
return; 
ac gray delta(d, ldn-1); // recursion 
ulong n = 1UL<<ldn; 
ulong nh - n/2; 
if ( O--(1dn&1) ) 
1 
a ( ldn>=6 ) 
reverse(d, nh-1); 
for (ulong k-0; k<nh; ++k) d[k] = (1dn-2) - d[kl; 
} 
for (ulong k=0,j=n-2; k<j; ++k,--j) dLjl = alx]; 
d[nh-1] = ldn - 1; 
} 
else 
{ 
for (ulong k-nh-2,j-nh-1; O!=j; --k,--j) 
for (ulong k-2,j-n-2; k<j; ++k,--j) dLjl = alx]; 
d[0] = 0; 
d[nh] = 0; 
} 
} 


The program [FXT: comb/acgray-demo.cc| can be used to create AC Gray codes for n < 6. For n > 7 
the algorithm produces near-AC Gray codes, where the number of non-AC transitions equals 2"-5 — 1 


for odd values of n and 2”75 — 2 for n even: 


* non-AC transition 
=0..6 = 
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n= 12 #non-ac = 126 


Near-AC Gray codes with fewer non-AC transitions may exist. 


20.4 Edge sorting and lucky paths 


The order of the nodes in the representation of the graph does not matter with finding paths as the 
algorithm at no point refers to it. The order of the outgoing edges, however, does matter. 


20.4.1 Edge sorting 


Consider a large graph that has only a few paths. The calling tree of the recursive function next_path() 
obviously depends on the edge order. Therefore the first path can appear earlier or later in the search. 
‘Later’ may well mean that the path is not found within any reasonable amount of time. 


With a bit of luck one might find an ordering of the edges of the graph that will shorten the time until 


the first path is found. The program [FXT: graph/ graph-monotonicgray-demo.cc searches for monotonic 
Gray codes and optionally sorts the edges of the graph. The fo 


owing method sorts the outgoing edges 
of each node according to a supplied comparison function [FXT: graph/digraph.cc : 


1  digraph::sort edges(int (*cmp)(const ulong &, const ulong &)) 
2 
3 if ( O--vn )  // value -- index (in e[]) 
4 
5 for (ulong k-0; k<ng_; ++k) 
6 1 
7 ulong x = ep. [kl]; 
8 ulong n = ep [k*ti] - x; 
9 selection sort(e *x, n, cmp); 
10 } 
12 else // values in vn[] 
14 for (ulong k=0; k<ng_; ++k) 
{ 
16 ulong x = ep [k]; 
17 ulong n = ep [k*1] - x; 
18 idx selection sort(vn , n, e *x, cmp); 
19 } 
20 } 
21 } 


The comparison function actually used imposes the lexicographic order shown in section|1.26 on page 70 


int my_cmp(const ulong &a, const ulong &b) 


if ( a==b ) return 0; 
#define CODE(x) lexrev2negidx(x); 

ulong ca = CODE(a); 

ulong cb = CODE(b); 

return (ca<cb ? +1 : -1); 


ONDOJ 4 OS. A 


The choice was inspired by the observation that the bit-wise difference of successive elements in bit-lex 
order is either one or three. We search until the first path for 8-bit words is found: for the unsorted graph 
this task takes 1.14 seconds, for the sorted it takes 0.03 seconds. 


20.4.2 Lucky paths 


The first Gray code found in the hypercube graph with randomized edge order is shown in figure [20.4] 
(left). The corresponding path, as reported by the method digraph. paths::print turns [FXT: 
graph/digraph-paths.cc|, is described in the right column. Here nn is the number of neighbors of node, 
xe is the index of the neighbor (next) in the list of edges of node. Finally xf is the index among the 
free nodes in the list. The latter corresponds to the value fct-1 in the function next. path given in 


section |20.2 on page 393 
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Oi 22220 0 1 3 Step: node -> next [xf xe / nn] 
4$ das. d 8 rt. i 0: 0-> 8 [0 O/ 4] 
2:. 1.1. 2 10 Slo 2 1: 8-» 10 [0 O/ 4] 
3: 111. 3 14 ...1 0 2: 10 -> 14 [o O/ 4] 
4: 1111 4 15 1... 3 3: 14 -> 15 [0 O/ 4] 
5: .111 3 7 da 4: 15-5» 7 [0 O/ 4] 
6: 341.2 3 "ME b: 7 -> 3 to 1/ 4] 
Tt uw. i 2 Sa. 2 6: 3 -> 2 [1 2/ 4] 
8: .11. 2 6 id. d 7: 2 -> 6 [0 3/ 4] 
9: .1.. 1 4 l1: 3 8: 6 -> 4 [o 0/ 4] 
10: 11..2 12 ciel 0 9: 4-> 12 [1 3/ 4] 
11: 11.1 3 13 lo. 10: 12 -> 13 [0 O/ 4] 
12: .1.12 5 .1.. 2 11: 13 -> 5 [0 1/ 4] 
13% 2. 11 1 1... 3 12: 5 -> 1 [0 3/ 4] 
14: 1..1 2 9 cede 1 13: 1 -> 9 [0 2/ 4] 
15: 4.11 3 11 [1.11 -] 14: 9-» 11 [0 O/ 4] 
Path: #non-first-free turns = 2 


Figure 20.4-A: A Gray code in the hypercube graph with randomized edge order (left) and the path 
description (right, see text). 


If xf equals zero at some step, the first free neighbor was visited. If xf is nonzero, a dead end was reached 
in the course of the search and there was at least one U-turn. If the path is not the first found, the U-turn 
might well correspond to a previous path. 


If there was no U-turn, the number of non-first-free turns is zero (the number is given as the last line of 
the report). If it is zero, we call the path found a lucky path. For each given ordering of the edges and 
each starting position of the search there is at most one lucky path and if there is, it is the first path 
found. 


If the first path is a lucky path, the search effectively ‘falls through’: the number of operations is a 
constant times the number of edges. That is, if a lucky path exists it is found almost immediately even 
for huge graphs. 


20.5 Gray codes for Lyndon words 


We search Gray codes for n-bit binary Lyndon words where n is à prime. Here is a Gray code for the 
5-bit Lyndon words that is a cycle: 


d. 
Pail 


RR 
RRB BRE 


sd 
ssis 
An important application of such Gray codes is the construction of single track Gray codes which can be 
obtained by appending rotated versions of the block. The following is a single track Gray code based on 
the block given. At each stage, the block is rotated by two positions (horizontal format): 


HHHHEHH --#H#-- -#HHH- ------ -—-### 
—H#HHH- -=---- -——iHH HHHEHHH = --H#H-- 
——--H##H #HHHHH --HH-- -##HH- ------ 
== RO sss =— -——iHHb HEHEHE 
===> -—-—iHHb RRA --##-- HH 


The transition count (the number of zero-one transitions) is by construction the same for each track. The 
all-zero and the all-one words are missing in the Gray code, its length equals 2” — 2. 
20.5.1 Graph search with edge sorting 


Gray codes for the 7-bit binary Lyndon words like those shown in figure can easily be found by a 
graph search. In fact, all of them can be generated in a short time: for n = 7 there are 395 Gray codes 
(starting with the word 0000. .001) of which 112 are cycles. 


The search for such a path for the next prime, n = 11, does not seem to give a result in reasonable time. 


E 
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ORs, “sad dies Lies Ls Lo as l. — Sau d —— sss 1 
dz 2M. 11 ees Baa | ¿bs dl ade dees A o eaei 11 
2: sera L 11 exl. ..11.1 ssl. i sos L dd A edd 
3: EN E ..111.1 «111.1 E .1.1.11 ev sel 
4: ..11111 sado ded ..11111 mL. LL .1.1111 ves 
5: ..11.11 scd 25111 .111111 selled .111111 ee 111.1 
6: 2d LL .11.111 «dd. 111 ..11111 .11.111 eves E lk 
T: «wd 111 .111111 vió Lo 11 211.11 ..1.111 A 
8: .11.111 .1.1111 sided PE lll. «sad. d 
97 .111111 1.1.11 sacos d ..1.111 111.1 +. 1111 
10: .1.1111 sure d.d ¿111 .11.111 + 11.1 ..11111 
11: .1.1.11 .. 1111 ..1111 .111111 .1.1 dd. 11 
12: A ded ded ..11111 1.1111 .1.1111 .111 ixl sd 
13: sss. L 2511.11 1.1.11 Ealt nes 11 eds dd 
14: scs e «i211 seared sss El oli dd 11.111 
15: e —— es 11 + «11.11 .1111 11,11 .111111 
16: ello L .111 .1..11 .. 111 ..11111 .1.1111 
17: usce dd Sed o —— eyes fi pesse 11 ..1111 sd. 1.11 


Figure 20.5-A: Various Gray codes through the length-7 binary Lyndon words. The first four are cycles. 


k: [node] lyn_dec lyn_bin #rot rot(lyn) diff delta 
0: [ 0] Use 1.950" anita 1 0 
1: [ 1] d. Opus 1t Q gensa 11 1 
2-2 JE 3] (© weeck tl Q 111 2 
3: [ 7] 15 :..1111 0 1111 3 
4: [ 13] 31 ..11111 0 ..11111 4 
b i 17] 63 .111111 0 .111111 5 
6: [ 15] 47 .1.1111 O 1.1111 4 
7: [ 10] 23 ..1.111 1 1.111. 0 
8: [ 16] 55 .11.111 1 11.111. 6 
9: [ 11] 27 ..11.11 2 11.11.. 1 
10: [ 5] 11 1.11 2 1.11. 6 
11: [ 14] 43 1.1.11 2 1.11.1 0 
12 E 6] 13 11.1 0 sullo 5 
13 : [ 12] 29 111.1 0 11:1 4 
14: [ 8] 19 1..11 3 11..1 2 
ib z dE 4] 9 1..1 0 i bue 4 
16: [ 9] 21 1.1.1 3 dos L 5 
17: E 2] 5 1.1 3 eed eens 0 


Figure 20.5-B: A Gray code through the length-7 binary Lyndon words. 


If we do not insist on a Gray code through the cyclic minima, but allow for arbitrary rotations of the 
Lyndon words, then more Gray codes exist. For that purpose nodes are declared adjacent if there is any 
cyclic rotation of the second node's value that differs in exactly one bit to the first node's value. The 


cyclic rotations can be recovered easily after a path is found. This is done in [FXT: graph/graph-lyndon- 
whose output is shown in figure |20.5-B| Still, already for n = 11 we do not get a result. 


As the corresponding graph has 186 nodes and 1954 edges, this is not a surprise. 


Now we sort the edges according to the comparison function [FXT: graph/lyndon-cmp.cc 


1 int lyndon cmpO(const ulong &a, const ulong &b) 

2 4 

3 int bc = bit count cmp(a, b); 

4 if ( bc ) return -bc; // more bits first 

5 else 

6 

7 if ( a--b ) return 0; 

8 return (a>b ? +1 : -1); // greater numbers last 
9 } 

0 } 


where bit. count cmp() is defined in [FXT: |bits/bitcount.h : 


1 static inline int bit count, cmp(const ulong &a, const ulong &b) 
2 t 

3 ulong ca = bit, count(a); 

4 ulong cb = bit, count (b); 

5 return ( ca==cb ? 0: (ca>cb ? +1 : -1) ); 

6 


We find à Gray code (which also is a cycle) for n — 11 immediately. Same for n — 13, again a cycle. The 
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k : [node] lyn_dec 
o: OL 0] 1 
1: E 1] 3 
2: [ 3] 7 
d v 7] 15 
4: [ 15] 31 
5: [ 31] 63 
6: [ 63] 127 
7: [ 125] 255 
8: [ 239] 511 
9: [ 417] 1023 
10: [ 589] 2047 
11: [ 629] 4095 
12: [ 618] 3071 
13: [ 514] 1535 
14: [ 624] 3583 
15: [ 550] 1791 
16: [ 626] 3839 
17: [ 567] 1919 
18: [ 627] 3967 
19: [ 576] 1983 
20 [ 628] 4031 
21 [ 581] 2015 
22 [ 404] 991 
23 [ 614] 3039 
24 : [ 508] 1519 
25 : [ 584] 2031 
[--snip--] 
615 : [ 4] 9 
616: [ 36] 73 
617 : [ 32] 65 
618 : [ 33] 67 
619 : [ 153] 323 
620: [ 65] 133 
621: [ 154] 325 
622: [ 79] 161 
623 : [ 16] 33 
624 : [ 126] 265 
625 : [ 145] 305 
626 : [ 130] 273 
627 : [ 188] 401 
628 : [ 71] 145 
629: [ 8] 17 


lyn_bin #rot 


::34111111111 
...1111111111 
..11111111111 
.111111111111 
.1.1111111111 
..1.111111111 
.11.111111111 
..11.11111111 
.111.11111111 
..111.1111111 
.1111.1111111 
..1111.111111 
.11111.111111 
..11111.11111 
...1111.11111 
.1.1111.11111 
..1.1111.1111 
..111111.1111 
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Figure 20.5-C: Begin and end of a Gray cycle through the 13-bit binary Lyndon words. 


graph for n = 13 has 630 nodes and 8,056 edges, so finding a path is quite unexpected. The cycle found 
starts and ends as shown in figure|20.5-C 


For next candidate (n = 17) we do not find a Gray code within many hours of search. No surprise for 
a graph with 7,710 nodes and 130,828 edges. We try another edge sorting scheme, an ordering based on 


the binary Gray code [FXT: graph/lyndon-cmp.cc|: 


int lyndon cmp2(const ulong &a, const ulong &b) 


if ( a--b ) return 0; 


#define CODE(x) gray. code(x) 


ulong ta = CODE(a), tb 
: -1); 


return ( ta«tb ? +1 


= CODE(b); 


We find a cycle for n = 17 and all smaller primes. All are cycles and all paths are lucky paths. The 


following edge sorting scheme also leads to Gray codes for all prime n where 3 < n < 17: 


int lyndon cmp3(const ulong &a, const ulong &b) 


if ( a--b ) return 0; 


#define CODE(x) inverse_gray_code(x) 
ulong ta = CODE(a), tb = CODE(b); 


return ( ta«tb ? +1 


1); 


Same for n = 19, the graph has 27,594 nodes and 523,978 edges. Indeed the sorting scheme leads to 
cycles for all odd n < 27. All these paths are lucky paths, a fact that we can exploit. 
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20.5.2 An optimized algorithm 


" | number of nodes | tag-size time n | number of nodes | tag-size time 

23 364,722 | 0.25 MB | 1 sec 35 981,706,830 1 GB lh 

25 1,342,182 1MB  3sec 37 3,714,566,310 4 GB 7h 

27 4,971,066 4 MB | 12 sec 39 14,096,303,342 | 16 GB 2d 

29 18,512,790 16 MB 1 min 41 53,634,713,550 | 64 GB 10d 

31 69,273,666 64 MB | 4 min 43 | 204,560,302,842 | 256 GB | >40d 

33 260,301,174 | 256 MB | 16 min 45 | 781,874,934,568 1 TB | >160 d 
Figure 20.5-D: Memory and (approximate) time needed for computing Gray codes with n-bit Lyndon 
words. The number of nodes equals the number of length-n necklaces minus 2. The size of the tag array 


equals 2” /4 bits or 2” /32 bytes. 


With edge sorting functions that lead to a lucky path we can discard most of the data used with graph 
searching. We only need to keep track of whether a node has been visited so far. A tag-array ([FXT: 


ds/bitarray.h|, see section [4.6 on page 164) suffices. 
With n-bit Lyndon words the amount of tag-bits needed is 2”. Find an implementation of the algorithm 


as [FXT: class lyndon gray in graph/lyndon-gray.h|. 


If only the cyclic minima of the values are tagged, then only 2"/2 bits are needed if the access to the 
single necklace consisting of all ones is treated separately. T'his variant of the algorithm is activated by 
uncommenting the line define ALT, ALGORITM. As the lowest bit in a necklace is always one, we need 
only 2"/4 bits: simply shift the words to the right by one position before testing or writing to the tag 
array. This can be activated by additionally uncommenting the line define ALTALT in the file. 


When a node is visited, the algorithm creates a table of neighbors and selects the minimum among the 
free nodes with respect to the edge sorting function used. Then the table of neighbors is discarded to 
minimize memory usage. 


If no neighbor is found, the number of nodes visited so far is returned. If this number equals the number 
of n-bit Lyndon words, then a lucky path was found. With composite n a Gray code for n-bit necklaces 
(with the exception of the all-ones and the all-zeros word) will be searched. 


Four variants of the algorithm have been found so far, corresponding to edge sorting with the 3rd, 5th, 
21th, and 29th power of the Gray code. We refer to these functions as comparison functions 0, 1, 2, and 
3, respectively. All of these lead to cycles for all primes n € 31. The resources needed with greater values 


of n are shown in figure|20.5-D 


Using a 64-bit machine equipped with more than 4 Gigabyte of RAM, it can be verified that three of 
the edge sorting functions lead to a Gray cycle also for n — 37, the 3rd power version fails. One of the 
sorting functions may lead to a Gray code for n — 41. 

A program to compute the Gray codes is [FXT: |graph/lyndon-gray-demo.cc|, four arguments can be 
given: 


arg 1: 13 == n [a prime < BITS PER LONG ] default=17 

arg 2: 1 == wh [printing: 0--»none, 1==>delta seq., 2==>full output] default=1 
arg 3: 3 -- ncmp [use comparison function (0,1,2,3)] default-2 

arg 4: 0 == testall [special: test all odd values <= value] default=0 


An example with full output is given in figure [20.5-E| A 64-bit CRC (see section |41.3 on page 868] is 


computed from the delta sequence (rightmost column) and printed with the last word. 


For large n one might want to print only the delta sequence, as shown in figure |20.5-F| The CRC is used 
to determine whether two delta sequences are different. Different sequences sometimes start identically. 
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% ./bin7 20 s 7 bits, full output, comparison function 0 
n= 7 #lyn = 18 


O 1 O. fala TI neus 1 0 
2: sodes Q ull peda rs CO 
3: liii 3 ex edel PD 4 
4: ..1.111 3- 111..1 SIENTEN 5 
5: 1.1411 2 .1111.1 ere ae 2 
6: 1.1.11 2 .1.11.1 o 4 
T: .11.111 5 14.11.14 d Aat 6 
8: .111111 2 11111.1 SX. 4 
9: ..11111 2 11111.. 0 
10: ..111.1 2 111.1.. 3 
11: sodio deL 2 Listos 5 
12: deal 2- x3. 6 
13: obli 1-3 bot. 1 
14: .11.11 1 .11.11. 5 
15: sa 14.1 2 214 ot 1 
16: . 1411 2 .1111. 3 
17: ...111 2. 041... ; 5 
TBs wing thoes 11 2. nli vut mos 
last — i2. 11 crc=0b14a5846c41d57f 
n= Y #lyn = 18 = 18 


Figure 20.5-E: A Gray code for 7-bit Lyndon words. 


% ./bin 1312 # 13 bits, delta seq. output, comparison function 2 

n = 13 #lyn = 630 
06B57458354645962546436734A74684A106C0145120825747A745247AC8564567018A7654647484A756A546457CA1ACBC1C 
856BA9A64B97456548645659645219425215315BC82BC75BA02926256354267A462475A3ACB9761560C37412583758CA5624 
B8C6A6C6A87A9C20CBA4534042014540523129075697651563160204230A7BA31C1485C6105201510490BCA891BA9B1B9ACO 
A9A89B898A565B8785745865747845A9546702305A41275315458767465747A8457845470379A8586B0A7698578767976759 
A976567686A567656A576B86581305A20AB0ACB0AB53523438235465325247563A432532A37235465764357 2373624634642 
4532397423435235653236423263235234327532342325396926853234232582642436823632346362358423242383242327 
523242325323432642324235323423 

last = ........... 11 crc=568dab04b55aa2fb 

n = 13 ttlyn = 630 #= 630 


4 ./bin 1313 # 13 bits, delta seq. output, comparison function 3 

n = 13 #lyn = 630 
06B5745835464596254643673537 1CA8B1587BA7610635285A0C2484B97 13476B689A897 AC98768968B9A106326016261050 
1424B8979A78987B97898C098921941315313698314281687BCB9469C489C6210205B050A1A7A4568A9BC5CB79AB647B74812 
0AB30BC1A131ACB120B0164CA1CABA121ABACA2BOBACAB184578678498958486764648456191654694745787545865490137 
40201031012104270171216507457B854606C16BC523801365164130164BC7987 40987 2CBA9A87A20B787 AC9B7CBA834COC1 
3C341C1042010C14C01C414587854645A854C9503546495704A9756586B9B596958040872C3123BO0CB316BC6COB21B2COC2CO 
5301C0530CB1C1530C01CBOBC20CBCOCB1C87565756865A75A65A40898A898B9 1 CA898A8B898A81BC8A9ACA989AB817A9BC1 
BA9ABA9CA9AB918A1CACBACO9BCBOBC 
last = ........... 11 crc=745def277bifbedo 
n = 13 #lyn = 630 #= 630 


Figure 20.5-F: Delta sequences for two different Gray codes for 13-bit Lyndon words. 


% ./bin 2900 # 29 bits, output=progress, comparison function 0 
n= 29 #lyn = 18512790 


NOCTEM 1048576 ( 5.66406 % ) crc-ceabcbf2056be699 
EG ues rl a oe IUE des 2097152 ( 11.3281 % ) crc=76dd94f 1a554b50d 
Sten babe s a Aer Be 3145728 ( 16.9922 Y ) crc-6b39957f1e141f4d 
M rA ies sve E 4194304 ( 22.6563 % ) crc=53419af1f1185dc0 
Tree 5242880 ( 28.3203 % ) crc=45d45b193f8ee566 
LC 6291456 ( 33.9844 % ) crc-95a24c824f56e196 
Lue wat pad c euo s 7340032 ( 39.6484 % ) crc-003eebafbb248e34 
PEN Dd a 8388608 ( 45.3125 % ) crc-23cb74d3ea0c4587 
ufus irt le Gab 9437184 ( 50.9766 % ) crc-896fd04c87dd0d43 
DOM Nr 10485760 ( 56.6406 % ) crc=b00d8c899f0fc791 
e So BAY ale oe iue 11534336 ( 62.3047 % ) crc=d148f1b95b23eeab 
A a Si 12582912 ( 67.9688 % ) crc-82971e2ed4863050 
MU ES 13631488 ( 73.6328 % ) crc-f249adbb4fed252d 
"nov 14680064 ( 79.2969 % ) crc=909821d0c7246a98 
Vu dao Se gee egestas 15728640 ( 84.9609 % ) crc-1cbd68e38ebbb3ca 
aid 16777216 ( 90.625 % ) crc=0e64f82c67c79cf1 
POTE IC IP UE 17825792 ( 96.2891 % ) crc-62ci7b9f3c644396 


ASA re e us quere eroi ant 11 crc-5736fc9365da927e 
- 29 ttlyn = 18512790 #= 18512790 


Figure 20.5-G: Computation of a Gray code through the 29-bit Lyndon words. Most output is sup- 
pressed, only the CRC is printed at certain checkpoints. 
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For still greater values of n even the delta sequence tends to get huge (for example, with n = 37 the 
sequence would be approximately 3.7 GB). One can suppress all output except for a progress indication, 
as shown in figure [20.5-G] Here the CRC checksum is updated only with every (cyclically unadjusted) 
216_th Lyndon word. 


Sometimes a Gray code through the necklaces (except for the all-zeros and all-ones words) is also found 
for composite n. Comparison functions 0, 1, and 2 lead to Gray codes (which are cycles) for all odd 
n < 33. Gray cycles are also found with comparison function 3, except for n = 21, 27, and 33. All 
functions give Gray cycles also for n = 4 and n = 6. The values of n for which no Gray code was found 
are the even values > 8. 


20.5.3 No Gray codes for even n > 8 


As the parity of the words in a Gray code sequence alternates between one and zero, the difference 
between the numbers words of odd and even weight must be zero or one. If it is one, no Gray cycle can 
exist because the parity of the first and last word is identical. 


We use the relations from section|18.3.2}/on page For Lyndon words of odd length there are the same 
number of words for odd and even weight by symmetry, so a Gray code (and also a Gray cycle) can exist. 


For even length the sequence of numbers of Lyndon words of odd and even weights start as: 
n: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 


odd: 1, 2, 5, 16, 51, 170, 585, 2048, 7280, 26214, 95325, 349520, 1290555, 
even: 0, 1, 4, 14, 48, 165, 576, 2032, 7252, 26163, 95232, 349350, 1290240, 
diff: 1, 1, 1, 2, 3; Dis 9, 16, 28, 51, 93, 170, 315, 


The last row gives the differences, entry A000048 in [312]. All entries for n > 8 are greater than one, so 
no Gray code exists. 
For the number of necklaces we have, for n = 2,4,6,... 
n: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 
16, 52, 172, 586, 2048, 7286, 26216, 95326, 349536, 1290556, 


1, 25 6, 
even: 2, 4, 8, 20, 56, 180, 596, 2068, 7316, 26272, 95420, 349716, 1290872, 
1, 2, 2, 4, 4, 8, 10, 20, 30, 56, 94, 180, 316, 


The (absolute) difference of both sequences is entry A000013 in [312]. We see that for n > 4 the numbers 
are greater than one, so no Gray code exists. 
If we exclude the all-ones and all-zeros words, then the differences are 


n: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 
diff: 1, 0, 0, 2, 2, 6, 8, 18, 28, 54, 92, 178, 314, 


And again, no Gray code exists for n > 8. That is, we have found Gray codes, and even cycles, for all 
computationally feasible sizes where they can exist. 


Part III 


Fast transforms 
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Chapter 21 


The Fourier transform 


We introduce the discrete Fourier transform and give algorithms for its fast computation. Implementa- 
tions and optimization considerations for complex and real-valued transforms are given. The fast Fourier 
transforms are the basis of the algorithms for fast convolution described in chapter [22] These are in turn 
the core of the fast high precision multiplication routines treated in chapter The number theoretic 
transforms are treated in chapter Algorithms for Fourier transforms based on fast convolution like 
Bluestein’s algorithm and Rader’s algorithm are given in chapter [22] 


21.1 The discrete Fourier transform 


The discrete Fourier transform (DFT) of a complex sequence a = [ag,a1,...,an-—1] of length n is the 
complex sequence c = [cg,C1,---,n—1] defined by 
$e Fig (21.1-1a) 
1 n—1 : 
Ck I= a 5 Gy zt" where z=e?™/" (21.1-1b) 


vn x=0 
z is a primitive n-th root of unity: z” = 1 and zf # 1 for 0 < j < n. 


The inverse discrete Fourier transform is 


a = FU (21.1-2a) 


1 
de im — ey z ** (21.1-2b) 
n 


To see this, consider the element y of the inverse transform of the transform of a: 


n—1 


FU, = ag oye eee aas 
k=0 x=0 
L Nas X DPF (21.1-3b) 
T k 


Now >>, (2*-9)* = n for x = y and 0 else. This is because z is an n-th primitive root of unity: with 
x = y the sum consists of n times 2% = 1, with x 4 y the summands lie on the unit circle (on the vertices 
of an equilateral polygon with center 0) and add up to 0. Therefore the whole expression is equal to 


II 


1 = T 1 ifx=y 
"20 bony = dy where ry := { 0 otherwise (21.1-4) 


Here we will call the transform with the plus in the exponent the forward transform. The choice is 
actually arbitrary, engineers seem to prefer the minus for the forward transform, mathematicians the 
plus. The sign in the exponent is called the sign of the transform. 


00 DAA MN A 
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The Fourier transform is linear: for a, 8 € C we have 
Flaa+Bb] = aFla] +8F[b] (21.1-5) 


Further Parseval's equation holds, the sum of squares of the absolute values is identical for a sequence 
and its Fourier transform: 


n—1 n—1 
> la? = led (21.1-6) 
x=0 k=0 


A straightforward implementation of the discrete Fourier transform, that is, the computation of n sums 
each of length n, requires O(n?) operations [FXT: fft/slowft.cc : 


void slow ft(Complex *f, long n, int is) 
1 
Complex h[n]; 
const double phO = is*2.0*M PI/n; 
for (long w-0; w<n; ++w) 
{ 
Complex t = 0.0; 
for (long k=0; k<n; ++k) 
t += f[x] * SinCos(phO*kx*w) ; 
} 
h[w] = t; 
} 
acopy(h, f, n); 
} 
The variable is = c = +1 is the sign of the transform, the function SinCos(x) returns the complex 


number cos(x) + i sin(x). Note that the normalization factor 1/4/n in front of the sums has been left 
out. The inverse of the transform with sign c is the transform with sign —o followed by a multiplication 
of each element by 1/n. 'The sum of squares of the original sequence and its transform are equal up to a 


factor 1/4/n. 


A fast Fourier transform (FFT) algorithm has complexity O(n log(n)). There are several different FFT 
algorithms with many variants. 


21.2 Radix-2 FFT algorithms 


We fix some notation. In what follows let a be a length-n sequence with n a power of 2. 


e Let ae”) and a(?*2 denote the length-n/2 subsequences of those elements of a that have even and 
odd indices, respectively. That is, a(*"*? = fao, a5, a4, a6, ... , a4] and a?) = [a4,a3,..., a4 i]. 


e Let af) and alright) denote the left and right subsequences, respectively. That is, a(/€/0 = 


ight) — 
[ao, Qj,---, An/2-1] and alris = (a, /2, Qn /241> +++ ial 


e Let c= S*a denote the sequence with elements c; = a4 e ?7!*^7/^ where o = £1 is the sign of the 
transform. The symbol S shall suggest a shift operator. With radix-2 FFT algorithms only S!/? is 
needed. Note that the operator $ depends on the sign of the transform. 


e In relations between sequences we sometimes emphasize the length of the sequences on both sides 


n/2 


as in aleven) £ plodd) 4 ¿(odd) In these relations the operators + and — are element-wise. 


m.m 
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21.2.1 Decimation in time (DIT) FFT 


The following observation is the key to the (radix-2) decimation in time (DIT) FFT algorithm, also called 
the Cooley-Tukey FFT algorithm: For even values of n the k-th element of the Fourier transform is 


n—1 n/2—1 n/2—1 
Fla], = Y ae 2 Y a5, 23% + Y 25,41 ¿27+Dk (21.2-1a) 
x=0 z=0 x=0 
n/2—1 n/2-1 
= p» azg z2* + z* 5 "NEST d (21.2-1b) 
x=0 x=0 


g2ni/n g — +1 is the sign of the transform, and k € {0,1,...,n — 1]. 


where z =e 
The identity tells us how to compute the k-th element of the length-n Fourier transform from the length- 
n/2 Fourier transforms of the even and odd indexed subsequences. 

To rewrite the length-n transform in terms of length-n/2 transforms, we have to distinguish whether 
0x k <n/2 or n/2 € k « n. In the expressions we rewrite k € (0,1,2,...,n—1) ask = j +64 where 
j € (0,1,2,...,n/2 — 1) and 6 € (0,1): 


n—1 n/2—1 n/2—1 
y anz? (+82) = 3 alguen) z2* 0493) y 44493 Y aloda) 222 (5 3) (21.2-2a) 
x==0 x=0 x=0 

n/2—1 n/2—1 

5 aleren) PEE) 4 2) 5 ae) 205 for 6=0 

= ee ey (21.2-2b) 

n/2—1 n/2—1 

5 aleren) PEE = zi 25 airs) 27) for = 

x=0 x=0 


The minus sign in the relation for 5 = 1 is due to the equality 21+17/2 = 73 ¿0/2 = 7j, 


Observing that z? is just the root of unity that appears in a length-n/2 transform we can rewrite the last 
two equations to obtain the radiz-2 DIT FFT step: 

Fla] n/2 F [aee] + SUF qlodd) | (21.2-3a) 

Fla] "P piala] -gF aa] (21.2-3b) 


The length-n transform has been replaced by two transforms of length n/2. If n is a power of 2, this 
scheme can be applied recursively until length-one transforms are reached which are identity (‘do nothing’) 
operations. 


The complexity is O (n log,(n)): there are log,(n) splitting steps, the work in each step is O(n). 
21.2.1.1 Recursive implementation 


A recursive implementation of radix-2 DIT FFT given as pseudocode (C++ version in [FXT: 


fft /recfft2.cc|) is 


1 procedure rec fft dit2(a[], n, x[], is) 

2  // complex a[0..n-1] input 

3  // complex x[0..n-1] result 

4 4 

5 complex b[0..n/2-1], c[0..n/2-1] // workspace 
6 complex s[0..n/2-1], t[0..n/2-1] // workspace 
T 

8 if n == 1 then // end of recursion 

9 

0 x[0] := a[0] 

1 return 
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12 } 
13 
1s nh := n/2 
16 for k:-0 to nh-1 // copy to workspace 
17 1 
18 s[k] := a[2*k] // even indexed elements 
13 : t[k] := a[2*k*1] // odd indexed elements 
21 
22 // recursion: call two half-length FFTs: 
23 rec fft dit2(s[], nh, b[], is) 
A rec_fft_dit2(t[], nh, c[], is) 
26 fourier_shift(c[], nh, is*1/2) 
27 
28 for k:=0 to nh-1 // copy back from workspace 
29 { 
30 x[k] = b[k] + c[k] 
31 x[k+nh] := b[k] - c[k] 
32 } 
33 ] 
The parameter is = ø = +1 is the sign of the transform. The data length n must be a power of 2. The 


result is returned in the array x[]. Note that normalization (multiplication of each element of x[] by 
1/4/n) is not included here. 


The procedure uses the subroutine fourier shift() which modifies the array c[] according to the 
operation S": each element c[k] is multiplied by e"27?*/^, It is called with v = +1 /2 for the Fourier 


transform. The pseudocode (C++ equivalent in [FXT: fft/fouriershift.cc|) is 


1 procedure fourier shift(c[], n, v) 

2 4 

1 for k:-0 to n-1 

5 c[k] := c[k] * exp(v*2.0*PI*I*k/n) 
6 } 

7 4} 


The recursive FFT-procedure involves O(n) function calls to itself, these can be avoided by rewriting it 
in a iterative way. We can even do all operations in-place, no temporary workspace is needed at all. The 
price is the necessity of an additional data reordering: the procedure revbin_permute(a[],n) rearranges 
the array a[] in a way that each element a, is swapped with az, where x is obtained from x by reversing 
its binary digits. Methods for doing this are discussed in section [2.6] on page [118] 


21.2.1.2 Iterative implementation 


A non-recursive procedure for the radix-2 DIT FFT is (C++ version in [FXT: fft/fftdit2.cc ): 


1 procedure fft depth first dit2(a[], ldn, is) 

2  // complex a[0..2**1dn-1] input, result 

3 1 

4 n := 2**ldn // length of al] is a power of 2 
5 

6 revbin permute(a[]l, n) 

7 

8 for ldm:-1 to ldn // log 2(n) iterations 

9 1 

10 m :- 2**ldm 

1 mh := m/2 

13 for r:=0 to n-m step m // n/m iterations 
14 1 

15 for j:=0 to mh-1 // m/2 iterations 
16 1 

17 e := exp(is*2*PI*I*j/m) // log 2(n)*n/m*m/2 = log 2(n)*n/2 computations 
18 

19 u := a[r*j] 

20 v := alr+j+mh] * e 

21 

22 a[r*j] -utv 

23 alr+jtmh] := u - v 

24 } 


26 
27 
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} 
} 
This version of a non-recursive FFT procedure already avoids the calling overhead and it works in-place. 
But it is a bit wasteful. The (expensive) computation e := exp(is*2*PI*I*j/m) is done n/2 - log,(n) 
times. 


21.2.1.3 Saving trigonometric computations 


To reduce the number of sine and cosine computations, we can swap the two inner loops, leading to the 
first ‘real world’ FFT procedure presented here. A non-recursive procedure for the radix-2 DIT FFT is 
(C++ version in [FXT: |fft/fftdit2.cc ): 


procedure fft dit2(a[], ldn, is) 
// complex a[0..2**1dn-1] input, result 
1 
n := 2**1dn 
revbin permute(a[]l, n) 
for ldm:-1 to ldn // log 2(n) iterations 
1 
m := 2**1dm 
mh := m/2 
for j:-0 to mh-1 // m/2 iterations 
1 
e :7 exp(is*2*PI*I*j/m) // 1 * 2 * ... + n/8 + n/A + n/2 == n-1 computations 
for r:-0 to n-m step m 
1 
u :7 alr+j] 
v := alr+j+mh] * e 
a[r*jl =u+v 
alr+j+mb] := u - v 
} 
} 
} 


Swapping the two inner loops reduces the number of trigonometric computations to n but leads to a 
feature that many FFT implementations share: memory access is highly non-local. For each recursion 
stage (value of 1dm) the array is traversed mh times with n/m accesses in strides of mh. This memory 
access pattern can have a very negative performance impact for large n. If memory access is very slow 
compared to the CPU, the naive version can actually be faster. 

It is a good idea to extract the 1dm==1 stage of the outermost loop. This avoids complex multiplications 


with the trivial factors 14-0 4 and the computations of these quantities as trigonometric functions. Replace 
the line 


for ldm:-1 to ldn 
by the lines 
for r:-0 to n-1 step 2 


{ alr], a[r*i] } := { alr] + alr+1], alr] - alr+1] } // parallel assignment 


for ldm:-2 to ldn 
The parallel assignment would translate into the following C-code: 


Complex tmp1 = a[r] + a[r*i], tmp2 = alr] - alr+1]; 
alr] = tmpi; 
alr+1] = tmp2; 


21.2.2 Decimation in frequency (DIF) FFT 


By splitting the Fourier sum into a left and right half we obtain the decimation in frequency (DIF) FFT 
algorithm, also called Sande-Tukey FFT algorithm. For even values of n the k-th element of the Fourier 
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transform is 
n—1 n/2—1 n—1 
Fale = X agg = 5 ds 2*9 E 5 ine” (21.2-4a) 
x=0 g=0 r—n/2 
n/2—1 n/2—1 
= 5 ar 27 + 5 Az+n/2 Qt IE (21.2-4b) 
x=0 x=0 
n/2—1 
= 5 (alleft) + PLE. alright) ¿2h (21.2-4c) 
x=0 
where z = e7?Tt/n, y = +1 is the sign of the transform, and k € (0,1,...,n — 1]. 
Here one has to distinguish whether k is even or odd. Therefore we rewrite k € (0,1,2,...,n — 1} as 
k=2j+0 where j € (0,1,2,...,n/2 — 1) and 6 € {0,1}: 
n—1 n/2—1 
D mop CU = Y, (alles) y 40313) 9/2 ¿[right)) s (23-8) (21.2-5a) 
x=0 x=0 
n/2—1 
y (alleft) ae alright), PER) fo 6=0 
IA (21.2-5b) 


5 2” (alfo _ alright), PEE) for ó—1 
r—0 


Now 2239) n/2 — e*71? equals +1 for ô = 0 (even k) and —1 for 6 = 1 (odd k). The last two equations, 


more compactly written, are the radiz-2 DIF FFT step: 


Fal n/2 F[ allel) y alright) ] 
Fal n/2 FSA (alter _ arista) )] 


(21.2-6a) 
(21.2-6b) 


A recursive implementation of radix-2 DIF FFT is (C++ version given in [FXT: fft/recfft2.cc ) is 


1 procedure rec fft dif2(a[l, n, x[], is) 

2  // complex a[0..n-1] input 

3  // complex x[0..n-1] result 

4 4 

5 complex b[0..n/2-1], c[0..n/2-1] // workspace 
6 complex s[0..n/2-1], t[0..n/2-1] // workspace 
$ if n == 1 then 

9 { 

10 x[0] := a[0] 

11 return 

12 F 

13 

14 nh := n/2 

là for k:-0 to nh-1 

1T 1 

18 S[k] := alk] // ?left? elements 

19 t[k] := alktnh] // ’right’ elements 

20 } 

33 for k:=0 to nh-1 

24 { s{k], t[k] } := { s[x] + t[k], s[k] - t[k] } // parallel assignment 
25 } 

26 

2 fourier shift(t[], nh, is*0.5) 

29 rec fft dif2(s[], nh, b[], is) 

30 rec fft dif2(t[], nh, c[], is) 
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=0 
for k:=0 to nh-1 
x[j]  :- b[k] 
x[j+1] := c[k] 
j := j+2 
} 
The parameter is = ø = +1 is the sign of the transform. The data length n must be a power of 2. The 


result is returned in the array x[]. Again, the routine does no normalization. 
A non-recursive version is (the C++ equivalent is given in [FXT: [fft/fftdif2.cc ): 


procedure fft_dif2(a[] ,ldn,is) 
// complex a[0..2**1dn-1] input, result 
n := 2**1dn 
for ldm:=ldn to 1 step -1 
1 
m := 2**1dm 
mh :- m/2 
for j:-0 to mh-1 
1 


e := exp(is*2*PI*I*j/m) 


for r:-0 to n-m step m 
1 
u :7 alr+j] 
v := a[r*j*mh] 
a[r*jl = (u + v) 
alr+j+mh] := (u - v) * e 


} 
} 


revbin permute(a[]l, n) 


} 
In DIF FFTs the procedure revbin permute() is called after the main loop, in the DIT code it is called 


before the main loop. As in the procedure for the DIT FFT (section|21.2.1.3 on page 414) the inner loops 


were swapped to save trigonometric computations. 

Extracting the 1dm==1 stage of the outermost loop is again a good idea. Replace the line 
for ldm:=ldn to 1 step -1 

by 
for ldm:=ldn to 2 step -1 


and insert 
for r:-0 to n-1 step 2 


{ alr], a[r*1] } := { alr] + a[r*1], alr] - alr+1] } // parallel assignment 
before the call of revbin, permute(a[], n). 


21.3 Saving trigonometric computations 


The sine and cosine computations are an expensive part of any FFT. There are two apparent ways for 
saving CPU cycles, the use of lookup-tables and recursive methods. The CORDIC algorithms for sine 


and cosine given in section 33.2.1 on page 646|can be useful when implementing FFTs in hardware. 
21.3.1 Using lookup tables 


We can precompute and store all necessary values, and later look them up when needed. This is a good 
idea when computing many FFTs of the same (small) length. For FFTs of long sequences one needs large 
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lookup tables that can introduce a high cache-miss rate. So we may experience little or no speed gain, 
even a notable slowdown is possible. 


However, for a length-n FFT we do not need to store all the (n complex or 2 real) sine/cosine values 
exp(2c i k/n) = cos(2s k/n) + i sin(2z k/n) where k = 0,1,2,3,...,n — 1. The following symmetry 
relations reduce the interval from 0...27 to 0... m: 
cos(r--r) =  —cos(x) (21.3-1a) 
sin(T+x) = —sin(x) (21.3-1b) 


The next relations further reduce the interval to 0...7/2: 


cos(7/2+2) = —sin(x) (21.3-2a) 
sin(r/2+a2) = +cos(x) (21.3-2b) 


Finally, only the table of cosines is needed: 
sin(x) =  cos(r/2 — zx) (21.3-3) 


That is, already a table of the n/4 real values cos(27 i k/n) for k = 0,1,2,3,...,n/A — 1 suffices for a 
length-n FFT computation. The size of the table is thereby cut by a factor of 8. Possible cache problems 
can sometimes be mitigated by simply storing the trigonometric values in reversed order, as this avoids 
many equidistant memory accesses. 


21.3.2 Recursive generation 


We write E(x) for exp(i x) = sin(x) + i cos(x). In FFT computations one typically needs the values 
eo =E(p), e =E(ptly), e =E(p+27), es = E(p +37), ..., ex =H(ptky), ... 
in sequence. We could precompute g = E(y) and eo = E(q), and compute the values successively as 
ek = 9'Ck-1 (21.3-4) 


However, the numerical error grows exponentially, rendering the method useless (same for the recursions 


35.2-10a| and |35.2-10b| on page 679). A stable version of a trigonometric recursion for the computation 
of the sequence can be stated as follows. Precompute 


Co = cosy, (21.3-5a) 

SQ = sing, (21.3-5b) 

a = l-cosy [Cancellation!] (21.3-5c) 

2 
= 2 (sin >) [OK] (21.3-5d) 
B = siny (21.3-5e) 
Then compute the next pair (Ck+1, Sk+1) from (cx, sx) via 

Ck41 = Ck — (ace +B sx); (21.3-6a) 

Sky1 = 8% — (as, — B ck); (21.3-6b) 


Here we use the relation E(y+7) = E(y)—E(y)-z, this leads to z = 1—cos y—i sin y = 2 (sin Di sin y. 


A certain loss of precision still has to be expected, but even for very long FFTs less than 3 bits of precision 
are lost. When working with the C-type double it might be a good idea to use the type long double 
with the trigonometric recursion: the generated values will then always be accurate within the precision 
of the typedouble, provided long doubles are actually more precise than doubles. With exact integer 
convolution this can be mandatory. 


We give an example from [FXT: fht/fhtdif.cc|, the variable tt is y in relations |21.3-5d| and |21.3-5e 
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[--snip--] 

double tt = M PI 4/kh; // the angle increment 
double s1 = 0.0, ci = 1.0; // start at angle zero 
double al = sin(0.5*tt); 

al *= (2.0*a1); 

double be = sin(tt); 


for (ulong i-1; i<kh; i++) 
double t1 = ci; 
ci -= (al*titbe*s1); 
si -= (al*si-be*t1); 


// here ci = cos(tt*i) and si = sin(tt*i) 
[--snip--] 


21.4 Higher radix FFT algorithms 


Higher radix FFT algorithms save trigonometric computations. The radix-4 FFT algorithms presented 
in what follows replace all multiplications with complex factors (0, +i) by the obvious simpler operations. 
Radix-8 algorithms also simplify the special cases where the sines and cosines equal +,/1/2. 


'The bookkeeping overhead is also reduced, due to the more unrolled structure. Moreover, the number of 
loads and stores is reduced. 


We fix more notation. Let a be a length-n sequence where n is a multiple of m. 


e Let a(77*") denote the subsequence of the elements with index x where x = r mod m. For example, 
a(0%2) = gleven) and q(3794) = [az, a7, a11,a15,...]. The length of a^"? is n/m. 


e Let al"/") denote the subsequence obtained by splitting a into m parts of length n/m: a = 
[a0/m), q/m). ,,,, g((m-D/"?]. For example a(1/2 = afsh) and a(/9) is the last third of a. 


21.4.1 Decimation in time algorithms 

We rewrite the radix-2 DIT step (relations |21.2-3a| and |21.2-3b on page 412) in the new notation: 
Ffa] "7 SYZF| aD] y su2z[4092] (21.4-1a) 
Fla’ n/2 89/2 F | (0%?) | = SUAFT gt) ] (21.4-1b) 


The operator S is defined in section|21.2 on page 411| note that S°/? = S? is the identity operator. 


The derivation of the radix-4 step is analogous to the radix-2 step, it just involves more writing and does 


not give additional insights. So we just state the radiz-4 DIT FFT step which can be applied when n is 
divisible by 4: 
F[a] 9/9 n/a +81 p [ g(0%4) | ES SVAF[a 1%4) | qe SEF [aht] + S/F |a (3%4) ) a1. 4-2a) 
Hao n/4 4 SUF p [ a(060 | ES icS A F[a 1%4) ] — S24 [40%] — ics ^ F [af (3964) ^]21. 4-2b) 
Fla] 9/9 n/4 4S9/4F [4040] sU/4p[g099] y 57/4 p [a0] — S3/4 F/ a(8%4) 21.4.20) 
Pla] (3/4) n/4 459/47 | a(0€ | — io SV4F[ a09] — 82/4 F[ a2") | + i 9/4 F[ a(3%0 Y21.4-2d) 


The relations can be written more compactly as 


Fla jee n/4 A 4. S0/4 F| 069] Y eo 2m i15/A ¿SU gO] (21.4-3) 


FPLPLEPETE . S? F[ q2%4) ] 4 eg? 2713 5/4 SEL aime] 


where j € {0,1,2,3} and n is a multiple of 4. An even more compact form is 


3 
Fa a n/a S ers gts gi) j€10,1,2,3) (21.4-4) 
k=0 


OOND AUNE 
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where the summation symbol denotes element-wise summation of the sequences. The dot indicates 


multiplication of all elements of the sequence by the exponential. 
The general radiz-r DIT FFT step, applicable when n is a multiple of r, is: 


! r—1 
F[a]9/? n/r MIR als uus E^ ] j=0,1,2,...,r—1 
k=0 


Our notation turned out to be useful indeed. 

21.4.2 Decimation in frequency algorithms 

The radix-2 DIF step (relations [21.2-6a] and [21.2-6b on page 415), in the new notation, is 
Ha 0^9 n/2 F[ 8%? (ale/?) ds aa) ] 
Ha e n/2 F [si (a02 = aa) ] 


The radiz-4 DIF FFT step, applicable for n divisible by 4, is 


Pa E T [si (a09 + aS taO aA 


Hao - F[SUA(A0/0 4 ica /4) — q0/4 ig q(9/9) 


Fla e F[S/4(a0/ -isa /4) LIN tioa” 


i 
Fal n/t FEA (atero — a/a y 40/9 _ e| 


Again, c = +1 is the sign of the transform. Written more compactly: 


3 
Fla e n/4 Fst a onse qM] j € {0,1,2,3} 


k=0 
The general radix-r DIF FFT step is 
" r—1 
F[a]9 6r) n/r Fl sil" x PURI : a(*/7)] je {0, 1,2,..., r — 1} 
k=0 


21.4.3 Implementation of radix-r FFTs 


(21.4-5) 


(21.4-6a) 


(21.4-6b) 


(21.4-7a) 
(21.4-7b) 
(21.4-7c) 


(21.4-7d) 


(21.4-8) 


(21.4-9) 


For the implementation of a radix-r FFT with r Z 2 the revbin_permute routine has to be replaced by 


its radix-r version radix permute. The reordering now swaps elements a, with a; where í is obtained 
from x by reversing its radix-r expansion (see section [2.7 on page 121). In most practical cases one 


considers r = p” where p is a prime. Pseudocode for a radix r = p" DIT FFT: 


procedure fftdit r(a[l, n, is) 

// complex a[0..n-1] input, result. 

// r == power of p (hard-coded) 

// n == power of p (not necessarily a power of r) 
radix permute(a[l, n, p) 
lx := log(r) / log(p) // r == p ** lx 
ln := log(n) / log(p) 


ldm := (log(n)/log(p)) % 1x 
// 1x, ln, abd ldm are all integers 


if ( ldm != 0) // n is not a power of p 
1 
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XX := p**lx 
for z:-0 to n-xx step xx 
1 
fft dit xx(a[z..z*xx-1], is) // inlined length-xx DIT FFT 
} 
} 
for ldm:-1dm*lx to ln step 1x 
1 
m :- p**ldm 
mr := m/r 
for j := 0 to mr-1 
1 
e := exp(is*2*PI*I*j/m) 
for k:-0 to n-m step m 
1 
// All code in this block should be inlined and unrolled: 
// temporary ulL0..r-1] 
for z:-0 to r-1 
u[z] := a[k*j*mr*z] 
} 
radix permute(u[], r, p) 
d z:=1 to r-1 // e**0 == 
u[z] := u[z] * exx*z 
r point fft(u[], is) 
for z:-0 to r-1 
alk+j+mr*z] := ulzl 
+ 
} 
} 


Of course the loops that use the variable z have to be unrolled, the (length-p*) array u[] has to be 
replaced by explicit variables (for example, u0, ul, ... ), and the r_point_fft(u[] ,is) should be 
an inlined p*-point FFT. 


There is one pitfall: if one uses the radix-p permutation instead of a radix-p* permutation (for example, 
the radix-2 revbin_permute() for a radix-4 FFT), then some additional reordering is necessary in the 
innermost loop. In the given pseudocode this is indicated by the radix permute(u[],p) just before the 
p_point_fft(u[],is) line. 


21.4.4 Radix-4 DIT FFT 


A C++ routine for the radix-4 DIT FFT is given in [FXT: fft/fftdit4l.cc : 


4; // == 
2; // == log(r)/log(p) == log_2(r) 


static const ulong RX 
static const ulong LX 


void 
fft dit4l(Complex *f, ulong ldn, int is) 
// Decimation in time radix-4 FFT. 


double s2pi = ( is»0 ? 2.0*M_PI : -2.0*M PI ); 

const ulong n = (1UL<<ldn) ; 

revbin permute(f, n); 

ulong ldm = (1dng1); 

i ( ldm!=0 ) // n is not a power of 4, need a radix-2 step 


for (ulong r=0; r<n; r+=2) 
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19 i 
20 Complex a0 = f[r]; 
21 Complex al = f[r*1]; 
23 f[r] = a0 + al; 
24 f [r+1] = 20 - al; 
25 } 
26 } 
38 ldm += LX; 
30 for ( ; ldm<=ldn ; ldm*-LX) 

1 
32 ulong m = (1UL<<ldm) ; 
33 ulong m4 = (m>>LX); 
34 double phO = s2pi/m; 
36 for (ulong j=0; j<m4; j++) 

{ 

38 double phi = j*ph0; 
39 Complex e = SinCos(phi) ; 
40 Complex e2 = SinCos(2.0*phi) ; 
41 Complex e3 = SinCos(3.0*phi); 
43 for (ulong r-0; r<n; r+=m) 
45 ulong i02 j + r; 
46 ulong ii = iO + m4; 
47 ulong i2 = il + m4; 
48 ulong i3 = i2 + m4; 
50 Complex a0 = f[i0]; 
51 Complex al = f[i2]; // (!) 
52 Complex a2 = f[ii]; // (!) 
53 Complex a3 = f[i3]; 
23 al *= e; 
56 a2 *= e2; 
57 a3 *= e3; 
59 Complex tO = (a0*a2) + (al+a3); 
60 Complex t2 = (a0*a2) - (aita3); 
62 Complex ti = (a0-a2) + Complex(0,is) * (al-a3); 
63 Complex t3 = (a0-a2) - Complex(0,is) * (al-a3); 
65 f[iO] = tO; 
66 f[ii] = t1; 
67 f[i2] = t2; 
68 f[i3] = t3; 
69 + 
70 } 
71 } 
72 y 


An additional radix-2 step has been prepended which is used when n is an odd power of 2. To improve 
performance, the call to the procedure radix permute(u[],p) of the pseudocode has been replaced by 
changing indices in the loops where the a[z] are read. The respective lines are marked with the comment 


1/1 OD. 


A reasonably optimized radix-4 DIT FFT implementation is given in [FXT: . The transform 
starts with a radix-2 or radix-8 step for the initial pass. The core routine is hard-coded for e = +1 and 
called with swapped real and imaginary part for the inverse transform as explained in section [21.7 on] 
The routine uses separate arrays for the real and imaginary parts, which is very problematic 
with large transforms: the memory access pattern in large skips will degrade performance. 


Radix-4 FFT routines that use the C++ type complex are given in [FXT: fft/cfftdit4.cc|. These should 


be preferred for large transforms. The core routine is hard-coded for ø = —1, therefore the name suffix 
_ml: 

1 void 

2 fft dit4 core mi(Complex *f, ulong ldn) 

3 // Auxiliary routine for fft dit4(). 
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// Radix-4 decimation in time (DIT) FFT. 
// ldn := base-2 logarithm of the array length. 
// Fixed isign = -1. 
// Input data must be in revbin_permuted order. 
{ 
const ulong n = (1UL««ldn); 
if ( n<=2 ) 
if ( n--2) sumdiff(f[0], £[11); 
return; 
ulong ldm = ldn & 1; 
if ( ldm!=0 ) // n is not a power of 4, need a radix-8 step 
for (ulong i0=0; i0<n; i0+=8) £fft8 dit core mi(f*iO); // isign 
} 
else 
for (ulong i0=0; i0<n; i0+=4) 
{ 
ulong ii = iO + 1; 
ulong i2 = ii + 1; 
ulong i3 = i2 + 1 
Complex x, y, u, V; 
sumdiff(f[iO], f[i1], x, u); 
sumdiff(f[i2], f[i3], y, v); 
v *- Complex(0, -1); // isign 
sumdiff(u, v, f[iil, f[i3]); 
sumdiff(x, y, f[i0], f[i2]); 
} 
l 
ldm += 2 * LX; 
for ( ; ldm<=ldn; ldm*-LX) 
1 
ulong m = (1UL<<ldm) ; 
ulong m4 = (m>>LX); 
const double phO = -2.0*M_PI/m; // isign 
for (ulong j=0; j<m4; j++) 
{ 
double phi = j * pho; 
Complex e = SinCos(phi); 
Complex e2 = e re; 
Complex e3 = e2 * e; 
for (ulong r-0; r<n; r+=m) 
ulong i0 = j +r; 
ulong ii = iO + m4; 
ulong i2 = ili + m4; 
ulong i3 = i2 + m4; 
Complex x = f[ii] * e2; 
Complex u; 
sumdiff3_r(x, f[i0], u); 
Complex v = f[i3] * e3; 
Complex y = f[i2] * e; 
sumdiff (y, v); 
v *= Complex(0, -1); // isign 
sumdiff (u, v, f[i1], f[i3]); 
sumdiff (x, y, f[i0], f[i2]); 
} 
} 
} 
} 


The sumdiff() function is defined in [FX T: aux0/sumdiff.h |: 


template <typename Type> 
static inline void sumdiff(Type &a, Type &b) 
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3 // fa, b) «--| {atb, a-b} 
4  { Type t=a-b; at=b; b=t; } 


The routine fft8 dit core mi1() is an unrolled size-8 DIT FFT (hard-coded for c = —1) given in 
[FX T: fft/fft8ditcore.cc|. We further need a version of the routine for the positive sign. It uses a routine 


fft8 dit core pi or the computation of length-8 DIT FFTs with c = —1. The following changes 
need to be made in the core routine [FXT: ffft/cfftdit4.cc : 


1 void 

2 fft dit4 core pi(Complex *f, ulong ldn) 

3  // Fixed isign = +1 

4 4 

5 [--snip--] 

6 for (ulong i0=0; i0<n; i0+=8) £fft8 dit core pi(f*iO); // isign 
7 [--snip--] 

8 v *- Complex(0, +1); // isign 

9 [--snip--] 

10 const double phO = +2.0*M_PI/m; // isign 
11 [--snip--] 

12 v *- Complex(0, +1); // isign 

13 [--snip--] 

14 $ 


The routine called by the user is 


void 
fft dit4(Complex *f, ulong ldn, int is) 
// Fast Fourier Transform 
// ldn := base-2 logarithm of the array length 
// is := sign of the transform (+1 or -1) 
// Radix-4 decimation in time algorithm 
1 
revbin permute(f, 1UL<<ld); 
if ( is>O ) fft dit4 core pi(f, ldn); 
else fft dit4 core mi(f, ldn); 


m COd000-I1O»C0 AU Ne 


Re 


} 


21.4.5 Radix-4 DIF FFT 
A routine for the radix-4 DIF FFT is (the C++ equivalent is given in [FXT: fft/fftdifAl.cc|) 


1 procedure fftdif4(a[], ldn, is) 
2 // complex a[0..2**1dn-1] input, result 
3 1 
$ n := 2**ldn 
for ldm := ldn to 2 step -2 
T 1 
8 m := 2x*ldm 
9 mr :- m/4 
H for j := 0 to mr-1 
12 { 
13 e := exp(is*2*PI*I*j/m) 
e2 :- e * e 
li e3 :=e2* e 
for r := 0 to n-m step m 
18 f 
19 O := alr+jl 
20 ul := a[r*j*mr] 
21 u2 := a[r*j*mr*2] 
22 u3 := a[r*j*mr*3] 
u0 + u2 
ul + u3 


x+y // == (u0+u2) + (ul+u3) 


NNO 

TAW 
ct ct 4x 

[NI 


27 2 x- y // == (u0+u2) - (ul+u3) 

35 x := ud - u2 

30 y := (ul - u3)*I*is 

31 ti := x +y // == (u0-u2) + (ui-u3)*I*is 
32 t3 := x - y // == (u0-u2) - (ui-u3)*I*is 


1 tl := t1 * e 
t2 := t2 * e2 
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30 t3 := t3 * e3 
38 a[r*j] = t0 
39 a[r*j*mr] -t2 // (!) 
40 alr+j+mr*2] := ti // (!) 
41 alr+j+mr*3] := t3 
42 } 
43 } 
E 
46 if is_odd(ldn) then // n not a power of 4 
47 { 
48 for r:=0 to n-2 step 2 
49 { 
50 {a[r], alrt+i]} := (alrl*a[r*1], a[r]-a[r+1]} 
51 
52 } 
53 
54 revbin, permute(a[],n) 
55 $ 
A reasonably optimized implementation, hard-coded for o = +1, is [FXT: fft/cfftdif4.cc 


4; 
2; 


static const ulong RX 
static const ulong LX 


1 
2 
t void 

5 fft dif4 core pi(Complex *f, ulong ldn) 

6 // Auxiliary routine for fft dif4(). 

7  // Radix-4 decimation in frequency FFT. 

8 // Output data is in revbin_permuted order. 

9  // dn := base-2 logarithm of the array length. 
10 // Fixed isign = +1 


11 4 

12 const ulong n = (1UL<<ldn); 

13 

14 if ( n<=2 ) 

15 1 

16 if ( n--2 ) sumdiff(f[O], f[1]); 
17 return; 

18 

19 

20 for (ulong ldm-ldn; ldm»-(LX««1); ldm--LX) 
21 1 

22 ulong m = (1UL<<ldm) ; 

23 ulong m4 = (m>>LX); 

24 

25 const double phO = 2.0*M_PI/m; // isign 
26 

27 for (ulong j=0; j<m4; j++) 

28 { 

29 double phi = j * ph0; 

30 Complex e = SinCos(phi) ; 

31 Complex e2 = e * e; 

32 Complex e3 = e2 * e; 

33 

34 for (ulong r-0; r<n; r+=m) 

35 1 

36 ulong i0 = j + r; 

37 ulong ii = iO + m4; 

38 ulong i2 = il + m4; 

39 ulong i3 = i2 + m4; 

49 Complex x, y, u, V; 

42 sumdiff(f[iO], f[i2], x, u); 
43 sumdiff(f[ii], f[i3], y, v); 
44 v *- Complex(0, +1); // isign 
45 

46 diffsum3(x, y, f[i01); 

AT f[iit] = y * e2; 

48 

49 sumdiff (u, v, x, y); 

50 f[i3] = y * e3; 

51 f[i2] = x * e; 

52 } 

53 } 
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57 if ( ildn & 1) // n is not a power of 4, need a radix-8 step 
58 1 

59 for (ulong i0=0; i0<n; i0+=8) £fft8 dif core pi(f*iO); // isign 
60 } 

61 else 

62 { 

63 for (ulong i0=0; i0<n; i0+=4) 

64 { 

65 ulong ii = i0 + 1; 

66 ulong i2 = ii + 1; 

67 ulong i3 = i2 + 1 

85 Complex x, y, u, V; 

70 sumdiff(f[iO], f[i2], x, u); 

71 sumdiff(f[iil, f[i3], y, v); 

72 v *- Complex(0, +1); // isign 

73 sumdiff(x, y, f[i0], f[i1]); 

74 sumdiff(u, v, f[i2], £[13]); 

75 } 

76 } 

TT 4} 


The routine for c = —1 needs changes where the comment isign appears [F XT: fft/cfftdif4.cc|: 


1 void 
2 fft dif4 core mi(Complex *f, ulong ldn) 
3  // Fixed isign = -1 
4 4 
5 [--snip--] 
6 const double phO = -2.0*M PI/m; // isign 
7 [--snip--] 
8 v *- Complex(0, -1); // isign 
9 [--snip--] 
10 for (ulong i0=0; i0<n; i0+=8) £fft8 dif core mi(f*iO); // isign 
11 [--snip--] 
12 v *- Complex(0, -1); // isign 
13 [--snip--] 
14 } 
'The routine called by the user is 
1 void 
2 fft dif4(Complex *f, ulong ldn, int is) 
3 // Fast Fourier Transform 
4  // ldn := base-2 logarithm of the array length 
5  // is := sign of the transform (+1 or -1) 
6  // radix-4 decimation in frequency algorithm 
7 
8 if ( is>O ) fft_dif4_core_p1(f, ldn); 
9 else fft dif4 core mi(f, ldn); 
10 revbin permute(f, 1UL««1dn); 
11 > 


A version that uses the separate arrays for real and imaginary part is given in [FX T: fft/fftdifA.cc|. Again, 
the type complex version should be preferred for large transforms. To convert a complex array to and 
from a pair of real and imaginary arrays, use the zip permutation described in section on page 


21.5 Split-radix algorithm 


The idea underlying the split-radiz FFT algorithm is to use both radix-2 and radix-4 decompositions at 
the same time. We use one relation from the radix-2 (DIF) decomposition (relation |21.2-6a on page 415} 
the one for the even indices) and for the odd indices we use the radix-4 splitting (relations 
and |21.4-7d on page 419) in a slightly reordered form. The radix-4 decimation in frequency (DIF) step 
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for the split-radix FFT is 


Fla E F] (ao ES aa) ] (21.5-1a) 
Piaj "S n/a FTS (ata z a) dedos (0/0 E a®/9)) ] (21.5-1b) 
Fale” n/a F[53/4((0(0/0 as a Le (0/0 B a®/9)) ] (21.5-1c) 


Now we have expressed the length-N = 2" FFT as one length-N/2 and two length-N/4 FFTs. The 
operation count of the split-radix FFT is actually lower than that of the radix-4 FFT. With the introduced 
notation it is easy to write down the DIT version of the algorithm. The radix-4 decimation in time (DIT) 
step for the split-radix FFT is 


Fla (Fla ] ste [a9 T) (21.5-2a) 
F[ yo n/a (7 [a0] — SAF] (2% ]) pues (F [0%] — SF [0% ]) (21.5-2b) 


Fla] n/a (A[a(0%0] — g2A p g2%4 ]) — jos /4 (Epa — S9 [a7] (21.5-2c) 


2 


The split-radix DIF algorithm can be implemented as 


1 procedure fft_splitradix_dif(x[], y[], ldn, is) 
2 
n := 2**1dn 
if n<=1 return 
n2 := 2*n 
for k:-1 to ldn 
10 1 
11 n2:-n2/2 
12 n4 :- n2/4 
13 
14 e := 2 * PI / n2 
là for j:-0 to n4-1 
17 1 
18 a:=jr*e 
19 cc1 := cos(a) 
20 ssi := sin(a) 
21 cc3 := cos(3*a) // == 4*ccl*(cci*cci-0.75) 
22 ss3 := sin(3*a) // == 4*ssi*(0.75-ssi*ss1) 
31 ix := j 
3 id := 2*n2 
while ix«n-1 
28 1 
30 i0 := ix 
while i0 <n 
31 { 
il := iQ + n4 
i2 := il + n4 
A i3 := i2 + n4 
36 { x[iO], ri } := { x[i0] + x[i2], x[iO] - x[i2] + 
37 { x[i1], r2 } := { x[i1] + x[i3], x[i1] - x[i3] > 
38 
39 { y[10], si } := { yLi0] + y[i2], ylio] - yli2] > 
40 { y[i1], s2 ) := € ylit] + y[i3], y[i1] - y[i3] } 
41 
42 { ri, s3 ) := { ri*s2, ri-s2 } 
43 1 r2, s2 ) := d r2+s1, r2-si > 
44 
45 // complex mult: (x[i2],y[i2]) := -(s2,r1) * (ss1,cc1) 
46 x[i2] := ri*cci - s2*ssi 
4T yli2] := -s2*cci - ri*ss1 
48 
49 // complex mult: (y[i3],x[i3]) := (12,53) * (cc3,ss3) 
50 x[i3] := s3*cc3 + r2*ss3 
51 yli3] := r2*cc3 - s3*ss3 


93 iO := iO + id 


CONDE MINA 


10 
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} 


} 
ix :- 2 * id - n2 +j 
id := 4 * id 
} 
} 

} 

ix :2 1 

id := 4 


while ix<n 


for i0:=ix-1 to n-id step id 


ii :=i0+1 
{ x[i0], x[ii] } := 
{ yliol, yli1l } := 
} 
ix := 2 * id- 1 
id := 4 * id 


} 


revbin_permute(x[] ,n) 
revbin permute(y[],n) 


if is>0 
for j:=1 to n/2-1 


swap(x[j], x[n-j1) 


for j:=1 to n/2-1 
swap(y[jl, y[n-j1) 
} 
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{ x[iO] + x[i1], x[iO] - x[i1] + 
{ y[i0] + y[i1], y[i0] - y[i1] } 


The C++ implementation given in [FXT: fft/fftsplitradix.cc| uses a DIF core as above which is given 


in [129]. The C++ type complex version of the split-radix 
a DIF or DIT core, depending on the sign of the transform. Here we just give the D 


void 
Split radix dif fft core(Complex *f, ulong ldn) 
// Split-radix decimation in frequency (DIF) FFT. 


// 1dn 


:= base-2 logarithm of the array length. 


// Fixed isign = +1 
// Output data is in revbin_permuted order. 


{ 


if ( ldn==0 ) 
const ulong n = (1UL««1ldn); 


return; 


double s2pi - 2.0*M PI; 
ulong n2 - 2*n; 

for (ulong k-1; k<ldn; k++) 
1 


// pi*2*isign 


n2 >>= 1; // == n>>(k-1) == n, n/2, n/4, 
const ulong n4 = n2 >> 2; // == n/4, n/8, 
const double e = s2pi / n2; 
{ // 3==0: 
const ulong j = 0; 
ulong ix = j; 
ulong id = (n2<<1); 
while ( ix<n ) 
for (ulong i0=ix; iO<n; i0+=id) 
{ 
ulong ii = iO + n4; 
ulong i2 = ii + n4; 
ulong i3 - i2 * n4; 


Complex tO, t1; 
sumdiff3(f[iO], f[i2], t0); 
sumdiff3(f [i1], f[i3], t1); 


T given in [FXT: fft/cfftsplitradix.cc| uses 


Version: 


4 


APUNE 
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// t1 *= Complex(0, 1); // +isign 
ti = Complex(-t1.imag(), t1.real()); 
sumdiff (t0, t1); 
f[i2] = t0; // * Complex(cc1, ss1); 
f[i3] = t1; // * Complex(cc3, ss3); 
} 
ix = (id<<1) - n2 + j; 
id <<= 2; 
} 
} 
for (ulong j=1; j<n4; j++) 
{ 
double a = j * e; 
double cci,ssi, cc3,ss3; 
SinCos(a, &ssi, &cci); 
SinCos(3.0*a, &ss3, &cc3); 
ulong ix - j; 
ulong id - (n2««1); 
while ( ix<n ) 
for (ulong i0=ix; iO<n; i0+=id) 
{ 
ulong ii = iO + n4; 
ulong i2 = ii + n4; 
ulong i3 - i2 * n4; 
Complex tO, t1; 
sumdiff3(f[iO], f[i2], t0); 
sumdiff3(f[iil, f[i3], t1); 
ti = Complex(-t1.imag(), ti.real()); 
sumdiff(tO, t1); 
f[i2] = tO * Complex(cci, ss1); 
f[i3] = t1 * Complex(cc3, ss3); 
} 
ix = (id<<1) - n2 + j; 
id <<= 2; 
} 
} 
} 
for (ulong ix-0, id=4; ix<n; id*=4) 
for (ulong i0=ix; iO<n; i0+=id) sumdiff(f[i0], f[i0*1]); 
ix = 2*(id-1); 
} 
} 


The function sumdiff3() is defined in [FX T: aux0/sumdiff.h : 


template «typename Type» 

static inline void sumdiff3(Type &a, Type b, Type &d) 

// fa, b, d) <--| fa+b, b, a-b} (used in split-radix FFTs) 
{ d=a-b; at=b; } 


21.6 Symmetries of the Fourier transform 


A bit of notation again. Let à be the length-n sequence a reversed around the element with index 0: 


do :— ao (21.6-1a) 
Gn/2 :— Anjo if n even (21.6-1b) 
ak :— Anke = -k (21.6-1c) 


That is, we consider the indices modulo n and u is the sequence a with negated indices. Element zero 
stays in its place and for even n there is also an element with index n/2 that stays in place. 
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Example one, length-4: a := [0,1,2,3], then a = (0,3, 2, 1] (0 and 2 stay). 
Example two, length-5: a := [0,1, 2, 3, 4], then a = [0, 4, 3, 2, 1] (only 0 stays). 


Let ag and a4 denote the symmetric and antisymmetric parts of the sequence a, respectively: 


(a 4- a) (21.6-2a) 


ag :— 


QA t= (a — a) (21.6-2b) 


NIe wle 


The elements with index 0 (and n/2 for even n) of a4 are zero. We have 


a = agta, (21.6-3a) 
ag — GA (21.6-3b) 


I 


a 


Let c + id be the transform of the sequence a + ib, then 


Fl (as +a4)+i(bs+ba)] = (cst+ea)+i(dg+da) where (21.6-4a) 
Flas] = cs € R (21.6-4b) 

Flas] = ida €iR (21.6-4c) 

Flibs] = ids €iR (21.6-4d) 

Fliba] = ca ER (21.6-4e) 


Here we write a € R as a short form for a purely real sequence a. Equivalently, we write a € iR for a 
purely imaginary sequence. Thus the transform of a complex symmetric or antisymmetric sequence is 
symmetric or antisymmetric, respectively: 


Flag +ibs | cs t ids (21.6-5a) 
F|aa iba] = Ca+ida (21.6-5b) 


I 


'The real and imaginary parts of the transform of a symmetric sequence correspond to the real and 
imaginary parts of the original sequence. With an antisymmetric sequence the transform of the real and 
imaginary parts correspond to the imaginary and real parts of the original sequence. 


Fl(ag+aa)] = cs+ida (21.6-6a) 
Fli(bg+ba)] = catidg (21.6-6b) 
If the sequence a is purely real, then we have 
Flas] = +Flas| € R (21.6-7a) 
Flaa] = —Flaa] eiR (21.6-7b) 


That is, the transform of a real symmetric sequence is real and symmetric and the transform of a real 
antisymmetric sequence is purely imaginary and antisymmetric. Thus the transform of a general real 
sequence is the complex conjugate of its reversal: 


F|a] mE for a ER (21.6-8) 


Similarly, for a purely imaginary sequence b € ¿R, we have 


B 
> 
R 
Il 
+ 
ÀN 
m 
a 
Mm 
zd 


(21.6-9a) 
Flba] = —Flba] € R (21.6-9b) 


MID OMA WMH 


430 Chapter 21: The Fourier transform 


We compare the results of the Fourier transform and its inverse (the transform with negated sign 0) by 

symbolically writing the transforms as a complex multiplication with the trigonometric term (using C 
for cosine, S for sine): 

Fla+ib] : (a+ib)(C+iS) = (aC —bS) -i(bC aS) (21.6-10a) 

Fla+ib] : (a+ib)(C-iS) = (aC +bS)+i(bC —aS) (21.6-10b) 


The terms on the right side can be identified with those in relation |21.6-4a| Changing the sign of the 
transform leads to a result where the components due to the antisymmetric parts of the input are negated. 


Now write F for the Fourier transform and R for the reversal. We have F* = id, 7? = F7!, and F? = R. 
So the inverse transform can be computed as either 


F’ = RF= FR (21.6-11) 


21.7 Inverse FFT for free 


Some FFT implementations are hard-coded for a fixed sign of the transform. If we cannot easily modify 
the implementation into the transform with the other sign (the inverse transform), then how can we 
compute the inverse FFT? 


If the implementation uses separate arrays for the real and imaginary parts of the complex sequences to 
be transformed, as in 


procedure my_fft(ar[], ai[], ldn) // only for is==+1 ! 
// real ar[0..2**1dn-1] input, result, real part 
// real ai[0..2**1dn-1] input, result, imaginary part 


// Incredibly complicated code 
// that you cannot see how to modify 
// for is---1 
'Then do as follows: with the forward transform being 
my fft(ar[], ail], ldn) // forward FFT 
compute the inverse transform as 
my fft(ai[], ar[], ldn) // inverse FFT 


Note the swapped real and imaginary parts! The same trick works for a procedure coded for fixed is= —1. 


To see why this works, we note that 


Flatib] = Flag] +ioFlaa] +iF[bs] +0 F[ba] (21.7-1a) 
= Flas] +iF [bs] +io (Flaa] —¿F[ba]) (21.7-1b) 


For the computation with swapped real and imaginary parts we have 
Flbtia] = F[bs]+iFlas] +io (F[ba] —iFlaa]) (21.7-2a) 
Now the real and imaginary parts are implicitly swapped at the end of the computation, giving 


Flas] +iF[bs] —i0 (Flaa] —iF[ba]) = F '[a+ibd] (21.7-2b) 


When a complex type is used, then the best way to compute the inverse transform may be to reverse the 
sequence according to the symmetry of the Fourier transform given as relation the transform 
with negated sign can be computed by reversing the order of the result (use the routine reverse OO in 
[FX T: ). The reversal can also happen with the input data before the transform, which is 


advantageous if the data has to be copied anyway (use copy. reverse OO in [FXT: aux1/copy.h|). The 
additional work will usually not matter. 
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21.8 Real-valued Fourier transforms 


The Fourier transform of a purely real sequence c = F[a] where a € R has a symmetric real part 
(Ret = Rec, relation [21.6-8) and an antisymmetric imaginary part (Jmc = — Imc). The symmetric 
and antisymmetric parts of the original sequence correspond to the symmetric (and purely real) and 
antisymmetric (and purely imaginary) parts of the transform, respectively: 


Fla] = Flas] * ic [a4] (21.8-1) 


Simply using a complex FFT for real input is a waste by a factor 2 of memory and CPU cycles. There 
are several alternatives: 


e wrapper routines for complex FFTs (section |21.8.3 on the next page), 
e usage of the fast Hartley transform (section|25.5 on page 523], 
e special versions of the split-radix algorithm (section |21.8.4 on page 434). 


All techniques have in common that they store only half of the complex result to avoid the redundancy 
due to the symmetries of a complex Fourier transform of purely real input. The result of a real to complex 
FFT (R2CFT) contains the purely real components co (the ‘DC-part’ of the input signal) and, in case n is 
even, Cr /2 (the Nyquist frequency part). The inverse procedure, the complex to real transform (C2RFT) 
must be compatible to the ordering of the R2CFT. 


21.8.1 Sign of the transforms 


The sign of the transform can be chosen arbitrarily to be either +1 or —1. Note that the transform with 
the ‘other sign’ is not the inverse transform. The R2CFT and its inverse C2RFT must use the same sign. 


Some R2CFT and C2RFT implementations are hard-coded for a fixed sign. For the R2CFT with the other 
sign, negate the imaginary part after the transform. If we have to copy the data before the transform, 
then we can exploit the relation 


Fla] = Flas]-ioFlaa] (21.8-2) 
That is, copy the real data in reversed order to get the transform with the other sign. This technique 
does not involve an extra pass and should be virtually for free. 
For the complex to real FFTs (C2RFT) we have to negate the imaginary part before the transform to 
obtain the transform with the other sign. 


21.8.2 Data ordering 


Let c be the Fourier transform of the purely real sequence, stored in the array a[ 1. All given procedures 
use one of the following schemes for storing the transformed sequence. 


A scheme that interleaves real and imaginary parts (‘complex ordering’) is 


alo] = Reco (21.8-3) 
ali] = Re Cn/2 
a[2| = Rec, 
al3} = mc 
ala = Reco 
al5 Im co 
an—2] = Recyso-1 
an—1] = Imen- 


Ne PRR RE RR eRe 
CO 00 ID Ct 4 05 NR O (00 DOHA W NA 
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Note the absence of the elements Jm cy and Jm c,/5 which are always zero. 


Some routines store the real parts in the lower half and imaginary parts in the upper half. The data in 
the lower half will always be ordered as follows: 


alo] = Reco (21.8-4) 
a[1 = He C1 
a[2 = He C2 

a[n/2 = Meca 


For the imaginary part of the result there are two schemes: 
The ‘parallel ordering’ is 


a[n/24-1] = mc (21.8-5) 
a[n/24-2] = mce 
a[n/24-3] = Jmes 
a[n —1 = Jm Cn/2—1 
The ‘antiparallel ordering’ is 
a[n/2+1) = Imc/2-1 (21.8-6) 
a[n/2 +2 = Jm Cn/2—2 
a[n/2 +3 = Jm Cn/2-3 
an—-1] = mc 


21.8.3 Real-valued Fourier transforms via wrapper routines 


A complex length-n FFT can be used to compute a real length-2n FFT. For a real sequence a one feeds 
the (length-n) complex sequence f = al“) + ¿aledd) into a complex FFT. Some post-processing is 
necessary. This is not the most elegant real FFT available, but it is directly usable to turn complex FFTs 
into real FFTs. 


A C++ implementation of the real to complex FFT (R2CFT) is given in [FXT: realfft /realfftwrap.cc|, 


the sign of the transform is hard-coded to e = +1: 


void 
wrap real complex fft(double *f, ulong ldn) 
// Real to complex FFT (R2CFT) 


if ( ldn==0 ) return; 
fht_fft((Complex *)f, ldn-1, +1); // cast 


const ulong n = 1UL<<ldn; 
const ulong nh - n/2, n4 - n/4; 
const double phiO = M PI / nh; 
for(ulong i-1; i<n4; i++) 


ulong il = 2 * i; // re low [2, 4, ..., n/2-2] 
ulong i2 = ii + 1; // im low [3, 5, ..., n/2-1] 
ulong i3 = n - ii; // re hi [n-2, n-4, ..., n/2+2] 
ulong i4 = i3 + 1; // im hi [n-1, n-3, ..., n/2+3] 


double fir, f2i; 
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} 


sumdiff05(f[i3], f[ii], fir, f2i); 


double f2r, fli; 

sumdiff05(f [i2], f[i4], f2r, f1i); 
double c, s; 

double phi = i*phi0; 

SinCos(phi, &s, &c); 


double tr, ti; 
cmult(c, s, f2r, f2i, tr, ti); 


// f[ii1] 
// £ [i3] 


// ="= 
sumdiff(fir, tr, f[ii], f[i3]); 


fir * tr; // re low 
fir - tr; // re hi 


// f[i4] 
// f[i2] 


// ="= 
sumdiff( ti, fii, f[i4], f[i2]); 


is * (ti + fii); // im hi 
is * (ti - fii); // im low 


} 
sumdiff(f[0], £[1]); 
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The output is ordered according to relations |21.8-3} The same ordering must be used for the input for 


the inverse routine, the complex to real FFT 


o= +1: 


void 
wrap_complex_real_fft(double *f, ulong ldn) 
// Complex to real FFT (C2RFT). 


{ 


if ( ldn==0 ) return; 

const ulong n = 1UL<<ldn; 

const ulong nh = n/2, n4 = n/4; 
const double phiO = -M_PI / nh; 
for(ulong i-1; i<n4; i++) 


ulong il = 2 * i; // re low [2, 4, ..., n/2-2] 
ulong i2 = ii + 1; // im low [3, 5, ..., n/2-1] 
ulong i3 = n - ii; // re hi [n-2, n-4, ..., n/2+2] 
ulong i4 = i3 + 1; // im hi [n-1, n-3, ..., n/2+3] 


double fir, f2i; 
// double fir 
// double f2i 


f[ii] + f[i3]; // re symm 
f[ii] - f[i3]; // re asymm 


// —^- 
sumdiff(f[ii], f[i3], fir, f2i); 


double f2r, fii; 
// double f2r - 
// double fii = 


-f[i2] - f[i4]; // im symm 
f[i2] - fli4]; // im asymm 


// ="= 

sumdiff(-f[i4], f[i2], fii, f2r); 
double c, s; 

double phi = i*phi0; 

SinCos(phi, &s, &c); 


double tr, ti; 
cmult(c, s, f2r, f2i, tr, ti); 


// f[il] = fir + tr; // re low 
// £ [i3] - fir - tr; // re hi 
sundiff(fir, tr, flii], f[i3]); 
// f[i2] = ti - fii; // im low 
= ti + fili; // im hi 


// £[i4] 
a fii, f[i4], f[i2]); 

T ndit£C£[0], £[1]); 

if ( nh»-2 ) { f[nh] *= 2.0; f[nh*1] *= 2.0; > 


"2RFT). Again the sign of the transform is hard-coded to 
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fht_fft((Complex *)f, ldn-1, -1); // cast 
} 


21.8.4 Real-valued split-radix Fourier transforms 


We give pseudocode for the split-radix real to complex FFT and its inverse. The C++ implementations 


are given in [FXT: realfft/realfftsplitradix.cc|. The code given here follows [130], see also [318] (erratum 


for page 859 of [318]: at the start of the DO 32 loop replace the obvious assignments by CC1=COS(A), 
SS1-SIN(A), CC3=C0S (A3), SS3=SIN(A3)). 


21.8.5 Real to complex split-radix FFT 


We give a routine for the split-radix R2CFT algorithm, the sign of the transform is hard-coded to e = —1: 
procedure r2cft splitradix dit(x[], ldn) 

n := 2**1ldn 
revbin permute(x[], n); 
ix 
id := 4; 
do 
{ 


ii :- i0 * 1 

{ x[10], x[i1] > := { x[i0] + x[i1], x[i0] - x[ii] } // parallel assignment 
iO :- 10+ id 

2*id-1 
4 * id 


while ix«n 
n2 := 2 

nn := n/4 
fo nn!=0 


Q 
2*n2 
2*n2 
n2/4 

:= n2/8 
do // ix loop 


H- 
Qu 
Ion wo 


i0 := ix 
UN iO«n 


{ ti, x[i4] } := { x[i4] + x[i3], x[i4] - x[i3] + 


{ xlii], x[i3] ) :- € x[ii] + ti, x[ii] - ti } 
if n4!=1 
1 

il := il + n8 

i2 :- i2 * n8 

i3 := + n8 

i4 := i4 + ng 

ti :- (x[i3]*x[i4]) * sqrt(1/2) 

t2 := (x[i3]-x[i4]) * sqrt(1/2) 


{ x[i4], x[i3] ) := 4 x[i2] - ti, -x[i2] - ti ) 
i { x[i1], x[i2] } := { xlii] + t2, xliil - t2 } 
iO := iO + id 
} 
ix := 2*id - n2 
id := 2*id 


while ix<n 
2.0*PI/n2 
e 


e 
a: 
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for j:=2 to n8 
{ 
cci. : 
ssi 
cc3 
ss3 : 


cos(3*a) // == 4*cci*(cci*cci-0.75) 
sin(3xa) // == 4*ssi*(0.75-ssi*ss1) 


a := j*e 
-0 

- 2*n2 

do // ix-loop 


iQ := ix 
while i0<n 


w 
+++ ++++ + 


// complex mult: (t2,t1) := (x[i7],x[i3]) * (cc1,ss1) 
cc1 * x[i3] + ssi * x[i7] 
cci * x[i7] - ssi * x[i3] 


// complex mult: (t4,t3) :- (x[i8],x[i4]) * (cc3,ss3) 
cc3 * x[i4] + ss3 * x[i8] 
cc3 * x[i8] - ss3 * x[i4] 


ctetctet 


w 

il 
we A SS 

[] 


t6 + x[i6], t6 - x[i6] } 
x[i2] - t3, -x[i2] - t3 ) 


x[ii] + t5, x[ii] - t5 ) 


cà A a r^ 


x[i5] + t4, x[i5] - t4 ) 


ix 
id 


while ix«n 
nn :- nn/2 


} 
The ordering of the output is given as relations|21.8-4 on page 432] for the real part, and relations ]21.8-6 


for the imaginary part. 


21.8.6 Complex to real split-radix FFT 


The following routine is the inverse of r2cft_splitradix_dit(). The imaginary part of the input data 
must be ordered according to relations|21.8-6 on page 432| We give pseudocode for the split-radix C2RFT 
algorithm, the sign of the transform is hard-coded to e = —1: 


procedure c2rft splitradix dif(x[], ldn) 


B 

N 
Wow n ow d 

B 

N 

N 

N 
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// ix loop 


i0 := ix 
while i0<n 


15 i= d + nd 
13 := i2 + n4 
i4 := i3 + n4 
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{ x[i1], ti} := { x[i1] + x[i3], x[i1] - x[i3] > 
x[i2] := 2*x[i2] 

x[i4] := 2*x[i4] 

{ x[i3], x[i4] } := { ti + x[i4], ti - x[i4] > 


if n4!=1 
in 
il := 
i2 :- 
i3 :- 
i4 :- 


H-H-H-H- 
PUNE 


++++ 
B 
60 


{ x[ii], t1 } := { x[i2] + xlii], x[i2] - x[ii1] + 
{ t2, x[i2] } := { x[i4] + x[i3], x[i4] - x[i3] + 


while ix<n 


e: 


a 


1 


x[i3] := 
x[i4] := 
} 
iO := iO + 
ix := 2*id - n2 
id := 2*id 
2.0*PI/n2 
e 
for j:-2 to n8 
cci := cos(a) 
ssi := sin(a) 
cc3 := cos(3*a) 
ss3 := sin(3x*a) 
a := jx*e 
ix := 0 
id := 2*n2 


do E ix-loop 


-sqrt(2)*(t24t1) 
sqrt(2)*(t1-t2) 


// == 4*cci*(cci*cci-0.75) 
// == 4*ssi1*(0.75-ssi*ssi) 


10 := ix 
pa 10<n 
ii :=i0+ 3-1 
i2 := il + n4 
13 := 12 + nå 
14 :- 73 + nà 
15 :=10+n4-j+1 
i6 := ib + n4 
if := i6 + n4 
i8 := iY + n4 
{ x[i1], ti } := { x[i1] + x[i6], x[ii] - x[i6] + 
{ x[i5], t2 } := { x[i5] + x[i2], x[i5] - x[i2] > 
{ t3, x[i6] } := { x[i8] + x[i3], x[i8] - x[i3] + 
{ t4, x[i2] } := { x[i4] + x[i7], x[i4] - x[i7] + 
( t1, tb } := ( t1 + t4, t1 - t4 ) 
(t2, t4 ) := { t2 + t3, t2- t3 } 
// complex mult: (x[i7],x[i3]) := (t5,t4) * (ss1,cc1) 
x[i3] := cci * tb + ssi * t4 
x[i7] := -cci * t4 + ssi * t5 
// complex mult: (x[i4],x[i8]) := (t1,t2) * (cc3,ss3) 
x[i4] := cc3 * t1 - ss3 * t2 
x[i8] := cc3 * t2 + ss3 * t1 
iO := i0 + id 
ix := 2*id - n2 
id := 2*id 


while ix<n 
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101 nn := nn/2 
102 } 
163 ix := 1; 
105 id := 4; 
106 do 
7 
108 10 := ix-1 
vaile i0<n 
1 ii := iO + 1 
2 { x[i0], x[i1] } := 4 x[i0] + x[i1], x[iO] - x[i1] > 
3 iO := iO + id 
4 } 
a ix := 2*id-1 
9 id := 4 * id 
8 while ix<n 
120 revbin permute(x[], n); 
121 $ 


21.9 Multi-dimensional Fourier transforms 


Let az (x = 0,1,2,...,C — 1 and y —0,1,2,..., R — 1) be a 2-dimensional array. That is, an Rx C 
‘matrix’ of R rows (of length C) and C columns (of length R). Its 2-dimensional Fourier transform is 
defined by: 


ES (21.9-1a) 
C-1R-1 
1 : 
Ck,h ‘= Ja 5 > Gs, Zter/Ct+uh/R) where z= e727? (21.9-1b) 
x=0 y=0 


where k € {0,1,2,...,C — 1}, h € (0,,2,..., R— 1), and n= R- C. The inverse transform is 


a = EE] (21.9-2a) 
j Cara 
lay = 3 UNT QAI (21.9-2b) 
n cd h=0 
For an m-dimensional array az (where ài = (£1, £2, £3, ...,&m) and z; € 0,1,2,...,5;) the m-dimensional 


Fourier transform cg (where k= (kı, ko, k3, ... , km) and k; € 0,1,2,...,5;) is defined as 


$1—1 S2—1 Sm—1 
1 
eg = a y y a y ag 221 kı/S1 + z2 k2/S2 +... + Em km/Sm) (21.9-3a) 
x21=0 x2=0 Lm=0 


The inverse transform is, like in the 1-dimensional case, the complex conjugate transform. 


21.9.1 The row-column algorithm 
The equation of the definition of the 2-dimensional Fourier transform (relation ]21.9-la) can be recast as 


R-1 C-1 
1 
cn = —— M, lexp(yh/R) V asy exp (£ k/C) (21.9-4) 
vn y=0 x=0 


This shows that the 2-dimensional transform can be computed by applying 1-dimensional transforms, 
first on the rows, then on the columns. The same result is obtained when the columns are transformed 
first and then the rows. 


This leads us directly to the row-column algorithm for 2-dimensional FFTs. Pseudocode to compute the 
2-dimensional FFT of a[1[] using the row-column method: 


00 JODA Wh =e 


10 
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procedure rowcol ft(a[]l[], R, C, is) 
1 
complex a[R][C] // R (length-C) rows, C (length-R) columns 
for r:-0 to R-1 // FFT rows 
fft(a[r][], C, is) 
complex t[R] // temporary array for columns 
"id c:=0 to C-1 // FFT columns 
copy a[0,1,...,R-1][c] to t[] // get column 
fft(t[], R, is) 
copy t[] to a[0,1,...,R-1][c] // write back column 
} 


Here it is assumed that the rows lie in contiguous memory (as in the C language). The equivalent C++ 
code is given in [FXT: fft/twodimfft.cc]. 


Transposing the array before the column pass will, due to a better memory access pattern, improve 
performance in most cases: 


procedure rowcol fft2d(a[][], R, C, is) 
1 


complex a[R][C] // R (length-C) rows, C (length-R) columns 
for r:-0 to R-1 // FFT rows 


fft(alr][1, C, is) 


transpose( a[R][C] ) // in-place 
for c:=0 to C-1 // FFT columns (which are rows now) 


fft(a[cl[], R, is) 


transpose( a[C][R] ) // transpose back (note swapped R,C) 
} 


Transposing back at the end of the routine can be avoided if the inverse transform follows immediately 
as is typical for a convolution. The inverse transform must then be called with R and C swapped. 


21.10 The matrix Fourier algorithm (MFA) 


The matrix Fourier algorithm (MFA) is an algorithm for 1-dimensional FFTs that works for data lengths 
n = RC. It is quite similar to the row-column algorithm (relation [21.9-4 for 2-dimensional FFTs. The 
only differences are n multiplications with trigonometric factors and a final matrix transposition. 


Consider the input array as an R x C-matrix (R rows, C columns), with the rows contiguous in memory. 
Let c be the sign of the transform. The matrix Fourier algorithm (MFA) can be stated as follows: 


1. Apply a (length R) FFT on each column. 
2. Multiply each matrix element (index r,c) by exp(o 27irc/n). 
3. Apply a (length C) FFT on each row. 
4. Transpose the matrix. 
Note the elegance! A variant of the MFA is called four step FFT in [28]. 


A trivial modification is obtained by executing the steps in reversed order. The transposed matrix Fourier 
algorithm (TMFA) for the FFT: 


1. Transpose the matrix. 


2. Apply a (length C) FFT on each row of the matrix. 
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3. Multiply each matrix element (index r,c) by exp( 271 ir c/n). 
4. Apply a (length R) FFT on each column of the matrix. 


A variant of the MFA that, apart from the transpositions, accesses the memory only in consecutive 
address ranges can be stated as 


1. Transpose the matrix. 


. Apply a (length C) FFT on each row of the transposed matrix. 


2 
3. Multiply each matrix element (index r,c) by exp(c 271 ir c/n). 
4. Transpose the matrix back. 

5 


. Apply a (length R) FFT on each row of the matrix. 
6. Transpose the matrix (if the order of the transformed data matters). 


The ‘transposed’ version of this algorithm is identical. The performance will depend critically on the 
performance of the transposition routine. 


It is usually a good idea to use factors of the data length n that are close to yn. Of course we can 
apply the same algorithm for the row (or column) FFTs again: it can be an improvement to split n into 
3 factors (as close to n!/? as possible) if a length-n!/? FFT fits completely into cache. Especially for 
systems where CPU clock speed is much higher than memory clock speed the performance may increase 
dramatically. A speedup by a factor of 3 can sometimes be observed, even when compared to otherwise 
very well optimized FFTs. Another algorithm that is efficient with large arrays is the localized transform 


described (for the Hartley transform) in section |25.8 on page 529 


&O 00-1 O3 Ot i 02 b2 — 
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Chapter 22 


Convolution, correlation, and more 
FFT algorithms 


We give algorithms for fast convolution that are based on the fast Fourier transform. An efficient algorithm 
for the convolution of arrays that do not fit into the main memory (mass storage convolution) is given 
for both complex and real data. Further, weighted convolutions and their algorithms are introduced. 


We describe how fast convolution can be used for computing the z-transform of sequences of arbitrary 
length. Another convolution based algorithm for the Fourier transform of arrays of prime length, Rader's 
algorithm, is described at the end of the chapter. 


Convolution algorithms based on the fast Hartley transform are described in section The XOR 
(dyadic) convolution, which is computed via the Walsh transform is treated in section [23.8] The OR- 
convolution and the AND-convolution are described in section [23.12 


22.1 Convolution 


The cyclic convolution (or circular convolution) of two length-n sequences a = [ao,a1,..., a5 1] and 
b = [bo, b1, ..., b, .1] is defined as the length-n sequence h with elements h, as: 

h = a&b (22.1-1a) 

hp c= 5 à; by (22.1-1b) 


The last equation may be rewritten as 


n—1 


hp := 5 az bs) mod n (22.1-2) 
x==0 


That is, indices 7 — x wrap around, it is a cyclic convolution. A table illustrating the cyclic convolution 
of two sequences is shown in figure|22.1-A 


22.1.1 Direct computation 
A C++ implementation of the computation by definition is [FXT: |convolution/slowcnvl.h : 


template «typename Type» 

void slow convolution(const Type *f, const Type *g, Type *h, ulong n) 
// (cyclic) convolution: h[] := ff] (*) gll 

// n := array length 


for (ulong tau-0; tau<n; ++tau) 


Type s = 0.0; 
for (ulong k-0; kn; ++k) 
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D Oo 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 
O: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
15 1 2 3 4 5 6 7 8 9101112 131415 O 
2: 2 3 4 5 6 7 8 9 10111213 1415 0 1 
3: 3 4 5 6 7 8 910 11121314 15 O 1 2 
4: 4 5 6 7 8 91011 12 13 14 15 Oo 1 2 8 
5: 5 6 7 8 9101112 13 14 15 0O 12 3 4 
6: 6 7 8 9 10111213 1415 O 1 2 3 4 5 
T: 7 8 910 11 12 13 14 15 O 1 2 3 4 5 6 
8: 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 
9: 9 10 11 12 13 14 15 O 1 2 3 4 5 6 7 8 
10: 10 11 12 13 14 15 O 1 2 3 4 5 6 7 8 9 
11: 11 12 13 14 15 O 1 2 3 4 5 6 7 8 910 
12: 12 13 14 15 0 1 2 8 4 5 6 7 8 9 10 11 
13: 13 1415 O 123 4 5 6 7 8 9 10 11 12 
14: 1415 O 1 2 3 4 5 6 7 8 9 10 11 12 13 
15: 15 O 1 2 3 4 5 6 7 8 9 10 11 12 13 14 

n 0 1 2 3 (b) 

QO: 0 1 2 4 d 

(a) 1: 1 3 5 <--= h[5] contains a[i1]*b[2] 
a 4 8 9 <--= h[9] contains a[2]*b[2] 


Figure 22.1-A: Semi-symbolic table of the cyclic convolution of two sequences (top). The entries denote 
where in the convolution the products of the input elements can be found (bottom). 


10 { 

11 ulong k2 = tau - k; 

12 if ( (long)k2«0 ) k2 += n; // modulo n 
13 s += (f[k]*g[k2]); 

14 } 

15 h[tau] = s; 

16 } 

17 Y 


'The following version avoids the if statement in the inner loop: 


1 for (ulong tau-0; tau<n; ++tau) 

2 

3 Type s = 0.0; 

4 ulong k = 0; 

5 for (ulong k2-tau; k<=tau; ++k, --k2) s += (f[k]*g[k2]); 

6 for (ulong k2-n-1; k<n; ++k, --k2) s += (f[k]*g[k2]); // wrapped around 
T h[tau] = s; 

8 } 


For length-n sequences this procedure involves O(n?) operations, therefore it is slow for large n. For 
short lengths the algorithm is just fine. Unrolled routines will offer good performance, especially for 
convolutions of fixed length. For medium length convolutions the splitting schemes given in section 


on page 550|and section |40.2 on page 827|are applicable. 
22.1.2 Computation via FFT 


The fast Fourier transform provides us with an efficient way to compute convolutions that needs only 
O (n log(n)) operations. The convolution property of the Fourier transform is 


Flaeb] = Fla]-F[b] (22.1-3) 


OoN DTA Cb e 


AWN DL 
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The multiplication indicated by the dot is element-wise. That is, convolution in original space is element- 
wise multiplication in Fourier space. The statement can be motivated as follows: 


Fa [B]. = X aszt oe (22.1-4a) 
x y 

= 5 dt 5 b. 2250-79 where Yy=T- (22.1-4b) 

= 5 5 az 2, * b, ¿A = p» (= Ay, tra) ght (22.1-4c) 


(HE Za) = (F[ae]), (22.1-4d) 


k 
Rewriting relation |22.1-3|as 
a®b = F [F|a]-F[b]] (22.1-5) 
tells us how to proceed. We give pseudocode for the cyclic convolution of two complex sequences x[] 


and y[], the result is returned in y[]: 


procedure fft cyclic convolution(x[l, y[], n) 


1 
complex x[0..n-1], y[0..n-1] 


// transform data: 
fft(x[], n, +1) 
fft(y[l, n, +1) 


// element-wise multiplication in transformed domain: 
for i:-0 to n-1 


yli] := y[i] * x[i] 


// transform back: 
fft(y[l, n, -1) 


// normalize: 
ni:-1/n 
for i:=0 to n-1 
yli] := yli] * n1 
} 


It is assumed that the procedure £ft () does no normalization. For the normalization loop we precompute 
1/n and multiply as divisions are usually much slower than multiplications. 


22.1.3 Avoiding the revbin permutations 


We can save the revbin permutations by observing that any DIF FFT is of the form 


DIF_FFT_CORE(f, n); 
revbin_permute(f, n); 


and any DIT FFT is of the form 


revbin_permute(f, n); 
DIT_FFT_CORE(f, n); 


This way a convolution routine that uses DIF FFTs for the transform and DIT FFTs as inverse transform 
can omit the revbin permutations. This is demonstrated in the C++ implementation for the cyclic 


convolution of complex sequences [FXT: convolution/fftcoenvl.cc|: 


#define DIT_FFT_CORE fft_dit4_core_mi // isign -1 

#define DIF_FFT_CORE fft dif4 core pi // isign +1 

void 

fft complex convolution(Complex * restrict f, Complex * restrict g, 
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5 ulong ldn, double v/*-0.0*/) 

6  // (complex, cyclic) convolution: gf] := ff] (*) gf] 
7  // (use zero padded data for usual convolution) 

8  // dn := base-2 logarithm of the array length 


9  // Supply a value for v for a normalization factor != 1/n 
11 const ulong n = (1UL<<ldn); 
13 DIF FFT CORE(f, ldn); 

14 DIF_FFT_CORE(g, ldn); 
15 if ( v==0.0 ) v= 1.0/n; 

16 for (ulong i=0; i<n; ++i) 

17 { 

18 Complex t = g[i] * f[i]; 
19 gli] = t * v; 

} 
21 DIT_FFT_CORE(g, ldn); 
22 } 


The signs of the two FFTs must be different but are otherwise immaterial. 


The auto-convolution (or self-convolution) of a sequence is defined as the convolution of a sequence with 
itself: h = a & a. The corresponding procedure needs only two instead of three FFTs. 


22.1.4 Linear convolution 


ria O 1 2 3 4 5 6 7 8 91011 12 13 14 15 
0: O 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 
i: 123 4 5 6 7 8 9101112 13 14 15 16 
2: 23 45 6 7 8 9 10111213 14 15 16 17 
3: 3 4 5 6 7 8 910 11 12 13 14 15 16 17 18 
4: 4 5 6 7 8 9 1011 1213 14 15 16 17 18 19 
5: 5 6 7 8 29101112 13 14 15 16 17 18 19 20 
6: 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 
T: 7 8 910 1112 13 14 15 16 17 18 19 20 21 22 
8: 8 9 1011 12 13 14 15 16 17 18 19 20 21 22 

9: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
10: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
11: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
12: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
13: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
14: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
15: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 


Figure 22.1-B: Semi-symbolic table for the linear convolution of two length-16 sequences. 


The linear convolution of two length-n sequences a and b is the length-2n sequence h defined as 


h = a@tinb (22.1-6a) 
2n-1 

he = 3 asbes where 7=0,1,2,...¿2=1 (22.1-6b) 
x=0 


where we set az =0ifk<0ork> n, and the same for out-of-range elements bz. The linear convolution 
is sometimes called acyclic convolution, as there is no wrap around of the indices. We note that ha,-1, 
the last element of the sequence h, is always zero. 


The semi-symbolic table for the acyclic convolution is given in figure [22.1-B| The elements in the lower 
right triangle do not “wrap around” anymore, they go into extra buckets. Note there are 31 buckets 
labeled 0, 1, ..., 30. 


A routine that computes the linear convolution by the definition is [FXT: convolution/slowcnvl-lin.h): 


1 template <typename Type» 
2 void slow linear convolution(const Type *f, const Type *g, Type *h, ulong n) 
3  // Linear (acyclic) convolution. 
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// n := array length of a[] and b[] 
// The array h[] must have 2*n elements. 
{ 
// compute hO (left half): 
for (ulong tau=0; tau<n; ++tau) 
{ 
Type sO = 0; 
for (ulong k-0, k2-tau; k<=tau; ++k, --k2) sO += (f[kl*g[k2]1); 
h[tau] = s0; 
// compute hi (right half): 
for (ulong tau=0; tau<n; ++tau) 
{ 
Type si = 0; 
for (ulong k2-n-1, k-tau*i; k<n; ++k, --k2) si += (f[kl*g[k2]); 
h[n*tau] = s1; 
} 
} 


To compute the linear convolution of two length-n sequences a and b, we can use a length-2n cyclic 
convolution of the zero padded sequences A and B where 


A := [ao,a1,a2,...,051,0,0,...,0] (22.1-7a) 
B := [a9,01,02,..., 4n 1,0,0,...,0] (22.1-7b) 
With fast FF T-based algorithms for the cyclic convolution we can compute the linear convolution with 


the same complexity. 


Linear convolution is polynomial multiplication: let A = aj +a, £z +a2 £? +..., B = bo +b1 £ +b2 a7 4. .., 
and C = AB = co + ex + c x?  ..., then 


ép = Y ab (22.1-8) 
i+j=k 


This is just another way to write relation|22.1-6a Chapter [28] on page explains how fast convolution 
algorithms can be used for fast multiplication of multiprecision numbers. 
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Figure 22.2-A: Semi-symbolic table for the cyclic correlation of two length-16 sequences. 


The cyclic correlation (or circular correlation) of two real length-n sequences a = [ao, a1,..-,@n—1] and 
b = [bo, b1,... ,bn—1] can be defined as the length-n sequence h where 
hr := X o aby (22:9-1) 
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Figure 22.2-B: Semi-symbolic table for the linear (acyclic) correlation of two length-16 sequences. 


The relation can also be written as 
n—1 
h = 5 Ay ba) mod n (22.2-2) 
x=0 


The semi-symbolic table for the cyclic correlation is shown in figure For the computation of the 
linear (or acyclic) correlation the sequences have to be zero-padded as in the algorithm for the linear 
convolution. The corresponding table is shown in figure |22.2-B 


The auto-correlation (or self-correlation) is the correlation of a sequence with itself, the correlation of 
two distinct sequences is also called cross-correlation. The term auto-correlation function (ACF) is often 
used for the auto-correlation sequence. 


22.2.1 Direct computation 
A C++ implementation of the computation by the definition is [FXT: correlation/slowcorr.h : 


template «typename Type» 

void slow correlation(const Type *f, const Type *g, Type * restrict h, ulong n) 
// Cyclic correlation of f[], g[], both real-valued sequences. 

// n := array length 


for (ulong tau=0; tau<n; ++tau) 


Type s = 0.0; 
for (ulong k=0; k<n; ++k) 


ulong k2 = k + tau; 
if ( k2>=n ) K2 -=n; 
s += (g[k]*£[k2]) ; 
} 
h[tau] = s; 
} 
} 


The if statement in the inner loop is avoided by the following version: 


for (ulong tau=0; tau<n; ++tau) 
Type s = 0.0; 
ulong k = 0; 
for (ulong k2-tau; k2<n; ++k, ++k2) s += (g[k]*f[k2]); 
for (ulong k2-0; k<n; ++k, ++k2) s += (g[k]l*f[k2]); 
h[tau] = s; 


} 
For the linear correlation we avoid zero products: 
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1 template <typename Type» 
2 void slow correlationO(const Type *f, const Type *g, Type * restrict h, ulong n) 
3 // Linear correlation of f[], gl], both real-valued sequences. 
4 // n := array length 
5  // Version for zero padded data: 
6 //  flkl,glk]l == 0 for k-n/2 ... n-1 
7 //m must be >=2 
8 ( 
9 const ulong nh = n/2; 
10 for (ulong tau=0; tau<nh; ++tau) // k2 == tau + k 
11 1 
12 Type s = 0; 
13 for (ulong k-0, k2=tau; k2<nh; ++k, ++k2) s += (f[k]l*g[k2]); 
14 h[tau] = s; 
} 
17 for (ulong tau-nh; tau<n; ++tau) // k2 == tau+k-n 
19 Type s = 0; 
20 for (ulong k-n-tau, k2-0; k<nh; ++k, ++k2) s += (f[kl*g[k2]); 
21 h[tau] = s; 
} 
23 $ 


The algorithm involves O(n”) operations and is therefore slow with very long arrays. 


22.2.2 Computation via FFT 


A simple algorithm for fast correlation follows from the relation 
h- = FOF la er 8] | (22.2-3) 


That is, use a convolution algorithm with one of the input sequences reversed (indices negated modulo n). 
For purely real sequences the relation is equivalent to complex conjugation of one of the inner transforms: 


he = FF le) [6] | (22.2-4) 


For the computation of the self-correlation the latter relation is the only reasonable way to go: first 
transform the input sequence, then multiply each element by its complex conjugate, and finally transform 
back. A C++ implementation is [FXT: correlation/fftcorr.cc|: 


1 void 

2 fft correlation(double *f, double *g, ulong ldn) 

3 // Cyclic correlation of f[], g[], both real-valued sequences. 
4 // Result is written to gl]. 

5  // ldn := base-2 logarithm of the array length 

6 (t 

7 const ulong n-(1UL««1dn); 

8 const ulong nh=(n>>1); 

9 

10 fht real complex fft(f, ldn); // real, imag part in lower, upper half 
11 fht real, complex fft(g, ldn); 

12 

13 const double v = 1.0/n; 

14 glo] *= f[0] * v; 

15 glnh] *= f[nh] * v; 

16 for (ulong i-1,j-n-1; i«nh; ++i,--j) // real at index i, imag at index j 
17 1 

18 cmult n(f[il, -f[jl, glil, g[jl, v); 

19 } 

20 

21 fht complex real fft(g, ldn); 

22 $ 


The function cmult_n() is defined in [FX T: aux0/cmult.h : 


1 static inline void 

2 cmult n(double c, double s, double &u, double &v, double dn) 

3 // {u,v} «--| (dn*(u*c-v*s), dn*(u*stv*c) } 

4 { double t = u*stv*c; u *-c; u -= v*s; u *- dn; v = t*dn; > 


We note that relation |22.2-4| also holds for complex sequences. 
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22.2.3 Correlation and difference sets 1 


The linear auto-correlation of a sequence that contains zeros and ones only (a delta sequence) is the set 
of mutual differences of the positions of the ones, including multiplicity. An example: 


[1, 1, O, 1, 1, 0, 0, 0, 0, 0, O] <--= delta array R 
[4, 2, 1, 2, 1, 0, O, 1, 2, 1, 2] <--= linear ACF 
0,1,2, 3, 4, 5,-5,-4,-3,-2,-1 <--= index 

Element zero of the ACF tells us that there are four elements in R (each element has difference zero to 
just itself). Element one tells us that there are two pairs of consecutive elements, it is identical to the 
last element (element at index —1). There is just one pair of elements in R whose indices differ by 2 
(elements at index 2 and —2 of the ACF), and so on. The ACF does not tell us where the elements with 
a certain difference are. 

The delta array with ones at the seven positions 0, 3, 4, 12, 18, 23, and 25 has the ACF 


[7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, (+symm.)] 

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, — 26, <--= index 
That is, a ruler of length 26 with marks only at the seven given positions can be used to measure most 
of the distances up to 26 (the smallest missing distance is 10). Further, no distance appears more than 
once. Sequences with this property are called Golomb rulers and they are very hard to find. 


If we allow for two rulers, then the set of mutual differences in positions is the cross-correlation. For this 
setting analogues of Golomb rulers (that do not have any missing differences) can be found. We use dots 
for zeros: 

11..11..;.;: 11.11.2555: ei pe epe eee gae es rene Eee ge <= 


-= R1 
dia ee ada aa ¡A d dar tr rr 1302 <--= R2 
= Cr 


1, 
iitdiiiiiiitiiiiiiiiiiiiiiiiiiiiititiiiiititiiiiiiiiiiiiiiiiiiiii <-- oss-correlation 


The rulers are binary representations of the evaluations F(1/2) and F(1/4) of a curious function given 


in section |38.10.1| on page|750 


22.3 Correlation, convolution, and circulant matrices 1 


The cyclic correlation and convolution of two vectors correspond to multiplication with circulant matrices. 
In the following examples we fix the dimension to n = 4, the general case will be obvious. Let a = 
lao, az, a2, as], b = [bo, b1, b2, b3], and r = [ro, r1, ra, r3] the cyclic correlation of a and b (that is, 


fan ie. mod n 9 bk): 


ro = bo:ao+ bı: a, + ba - as + bs- as, (22.3-1a) 
p, = ia te Ada a bras, (22.3-1b) 
r2 = Dy ap bs: a, + bo: az + Op + as, (22.3-1c) 
Py = Bs ota? bg te Eb a tb (22.3-1d) 


We have r? = Rab? where Ra is a circulant matrix where row 0 is a and row k + 1 is the cyclic right 
shift of row k: 


ao Qi G2 Q3 
a a adi ag 

Ra = |? 79 (22.3-2) 
a2 G3 ag ay, 


a, a2 a3 Qo 


Now set c = [co, C1, C2, ca] to the cyclic convolution of a and b (that is, r+ = EPA mod n 0 Ok): 


Co = bo:ao+ bs: a+ b2: az +b1 - az, (22.3-3a) 
cı = bı: ao+ bo: a, + bs: as + bə: az, (22.3-3b) 
C2 = b2-ao+ bı- a, + bo: as + bs- az, (22.3-3c) 
c3 = «b3-ag+bo-a,+b,-a24+ bo: az (22.3-3d) 
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We have cf = C,, b? where C, = RT is a circulant matrix where column 0 is a7 and column k + 1 is the 
cyclic down shift of row k: 


Gy) 43 G2 Qı 
Coe FS VO ee ee (22.3-4) 


a2 a ao 43 
a3 2 41 ao 


Let F be the matrix corresponding to the Fourier transform (either sign, here we choose e = +1, so that 
w = +i): 


WwW? y y y +1 +1 +1 +1 
pelos (22.3-5) 
wW w we y +1 —i -1 + 
The convolution property of the Fourier transform can now be expressed as 
Cab? = F~ diag (F a”) Fb" (22.3-6) 
where diag(v) is the matrix having the components of v on its diagonal: 
v 0 0 0 
diag (lvo, vi vm e) = S Go 9 (22.3-7) 
0 0 0 v3 
The corresponding identity for the correlation is 
Rab? = F diag(F a?) Fto" (22.3-8) 
Relation restated as 
F'1C,F = diag(Fa") (22.3-9) 


shows that F diagonalizes a circulant matrix C, and its eigenvalues are Fa”, the components of the 
Fourier transform of a. The determinant of C, therefore equals the product of the elements of Fa”: 


n—1 
detC, = Il (a t atwl? aqu?) +... aua uei) (22.3-10) 
j=0 


Compare to relation |36.1-23| on page for the multisection of power series. 
22.4 Weighted Fourier transforms and convolutions 


We introduce the weighted Fourier transform and the weighted convolution which serve as an ingredient 
for the MFA based convolution algorithm described in section 


22.4.1 The weighted Fourier transform 


We define a new kind of transform by slightly modifying the definition of the Fourier transform (for- 


mula 21.1-1a on page 410): 


c = VW, la] (22.4-1a) 


n—1 
Ck i= 5 Vz az z7 F vr #0 Va (22.4-1b) 
x=0 
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where z :— e7?7*/". The sequence c is called the (discrete) weighted transform of the sequence a with 
the weight sequence v. The weighted transform with v, = Ta Va is just the usual Fourier transform. 
The inverse transform is 


m wt c (22.4-2a) 
n—1 
1 
de. = na (22.4-2b) 
N Ug 
k=0 
This can be seen as follows: 
1 n—1n-—1 1 n—1n-—1 1 
We! Wolall, = ——9».»,wa.z tot = — Y les az tz 95 =  (22.4-3a) 
NUY pue em n porzo Y 
je v 
= = 5 Vy — Gz ÔryNn = Ay (22.4-3b) 
n= vy 


Obviously all v; have to be invertible. That W, [W,' [a]] is also identity is apparent from the definitions. 


Given an FFT routine it is easy to set up a weighted Fourier transform. Pseudocode for the discrete 
weighted Fourier transform: 


procedure weighted ft(a[]l, v[], n, is) 
for x:-0 to n-1 


a[x] := a[x] * v[x] 


fft(a[], n, is) 


OOND AWN E 


} 
The inverse is essentially identical. Pseudocode for the inverse discrete weighted Fourier transform: 


procedure inverse_weighted_ft(a[], v[], n, is) 
fft(a[], n, -is) 
for x:-0 to n-1 


a[x] := alx] / vlx] 


ONDAN e 


} 
The C++ implementations are given in [FXT: fft/weightedfft.cc|. 


22.4.2 Weighted convolution 


In the definition of the cyclic convolution h of two sequences a and b (relations[22.1-TaJand on page 
440) we can distinguish between those summands where the index x + y wrapped around (£z +y = n4 7) 
and those where simply x + y = 7 holds. These are, following the notation in [116], denoted by h(? and 
A), respectively. We have 


h = hO-RhO where (22.4-4a) 

AO = Y dilema (22.4-4b) 
EST 

hD = X ls (22.4-4c) 
rT 


The sequences h(% and h® are the left and right half of the linear convolution sequence a Bin b, defined 


by relation |22.1-6a on page 443| For example, the linear self-convolution of the sequence [1, 1, 1, 1] is the 


length-8 sequence [ho][h1] = [1, 2, 3, 4][3, 2, 1, 0], its cyclic self-convolution is [ho + hi] = [4, 4, 4, 4]. 


The direct (slow) routine for linear convolution can be modified to compute just one of either h© or A) 
[FX T: convolution/slowenvlhalf.h|: 
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template <typename Type> 

void slow_half_convolution(const Type *f, const Type *g, Type *h, ulong n, int h01) 
// Half cyclic convolution. 

// Part determined by h01 which must be 0 or 1. 

// n := array length 


if ( O==h01 ) // compute ho: 
for (ulong tau-0; tau<n; ++tau) 


Type sO = 0.0; 


for (ulong k-0, k2-tau; k<=tau; ++k, --k2) s0 += (f[k]*g[k2]); 
h[tau] = s0; 
} 
F 
else // compute hi (wrapped part): 
{ 
for (ulong tau=0; tau<n; ++tau) 
: Type si = 0.0; 
for (ulong k2-n-1, k-tauti; k<n; ++k, --k2) s1 += (f[k]*g[k2]); 
h[tau] » s1; 
} 
} 


Define the weighted (cyclic) convolution h, by 
hig. = 04 ¡SI b (22.4-5a) 
= WD, [a] - W, [b]] (22.4-5b) 
where the multiplication indicated by the dot is element-wise. For the special case v, = V”, we have 
hy = AU y"0 (22.4-6) 


It is not hard to see why this is: Up to the final division by the weight sequence, the weighted convolution 
is just the cyclic convolution of the two weighted sequences. For the element h, we have 


he = 3. (@e V®) (by VY) = Soe bre V7 + M ardniraV "+? (224-7) 


r--yzT modn a<rT zT>T 


Final division of this element (by V7) gives AV +V” h® as stated. 


ne 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
0: 0 1 2 3 4 5 6 7T 8 9 10 11 12 13 14 15 
1: 1 24 (30 4 5 6 7 8 9 10 11 12 13 14 15 ©- 
2: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 .0- 1- 
3: 3 4 5 6 7 8 9 10 11 12 13 14 15 O- 1- 2- 
4: 4 5 6 7 8 9 10 11 12 13 14 15 0=- 1- 2- 3- 
5: 5 6 7 8 9 10 11 12 13 14 15 O- t= © 2> 35 245 
6: 6 7 8 9 10 11 12 13 14 45. 0- de  2- 3-7 4- .5- 
T: 7T 8 9 10 11 12 13 14 15 O- 1- 2- 3- 4- 5- 6- 
8: 8 9 10 11 12 14 15 O=: t=. 2-. “Sra O4 “b= v6= f= 
9: 9 10 11 12 13 14 15 = 12-:2--3-.404-- —5- (6S B= 
10: 10 11 12 13 14- “15 .0— de ¿22 3= 4- .5-  6- T= :8- .9- 
Ts 11 12 13 14 15- Q-— 4-. 2--. 23-— 2.4-.-5- 06, “f= :87 ,9- 10- 
12: 12 13 14 15 0-. t= 2— .3— . 4er 5b—- 6-2.7- .8- .9- 10- 11= 
13: 13 14 15 ©- do: 2=" 3= 4... 5b—-.6- (— 8= .9— 10- 11- 12- 
14: 14. 415. :0= be  2- .3-. do bs b= “Fe. Ban 9- :10— 11- 12- 13- 
15: iS: 20-- de > S37 4—:5--.6— “fs 8. 9- 40—-. ats 12> 13> 14- 


Figure 22.4-A: Semi-symbolic table for the negacyclic convolution. The products that enter with 
negative sign are indicated with a postfix minus at the corresponding entry. 


The cases when V” is some root of unity are particularly interesting. For V^ = +i = +y—1 we obtain 
the right-angle convolution: 


hy = hOzin® (22.4-8) 
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Choosing V” = —1 leads to the negacyclic convolution (or skew circular convolution): 
hy = hP- hO (22.4-9) 


Cyclic, negacyclic and right-angle convolution can be understood as polynomial products modulo the 
polynomials z” — 1, z” +1 and 2” Æ i, respectively (see [262]). 


The semi-symbolic table for the negacyclic convolution is shown in figure [22.4-A] With right-angle con- 
volution the minuses are replaced by i = y=1, so the elements in A) go to the imaginary part. With 
real input one effectively separates h) and A(). Therefore the linear convolution of real sequences can 
be computed using the complex right-angle convolution. 


The parts A and h( can be computed as sum and difference of the cyclic and the negacyclic convolution. 
Thus all expressions of the form ah + Bh() where a, 8 € C can be computed. 


The routine for the direct computation has complexity O(n?) [FXT: convolution/slowweightedenvl.h : 


1 template <typename Type» 

2 void slow weighted convolution(const Type *f, const Type *g, Type *h, ulong n, Type w) 
3  // weighted (cyclic) convolution: h[] := ff] (*)_w gl] 

4 // n := array length 

5 

6 for (ulong tau-0; tau<n; ++tau) 

7 

8 ulong k = 0; 

9 Type sO = 0.0; 

10 for (ulong k2-tau; k<=tau; ++k, --k2) sO += (f[k]*g[kx2]); 

11 Type s1 = 0.0; 

12 for (ulong k2-n-1; k<n; ++k, --k2) si += (f[k]*g[k2]); // wrapped around 
13 h[tau] = sO + si*w; 

14 } 

15 } 


Transform-based routines for the negacyclic and right-angle convolution are given in [FXT: 
tion/ weightedconv.cc|: 

1 #define FFTC(f,ldn,is) fht_fft(f,ldn,is) 

3 void 

4 weighted complex auto convolution(Complex *f, ulong ldn, double w, double v/*-0.0*/) 
5 // w= weight: 

6 // +0.25 for right angle convolution (-0.25 negates result in fi[]) 

7  // *0.5 for negacyclic convolution (also -0.5) 

8 // *1.0 for cyclic convolution (also -1.0) 


9 // 

10 4 v!=0.0 chooses alternative normalization 
11 

12 ulong n = (1UL««1dn); 

14 fourier_shift(f, n, w); 
15 FFTC(f, ldn, +1); 

17 if ( v==0.0 ) v = 1.0/n; 
18 for (ulong k=0; k<n; k++) 
19 { 

20 Complex t = f[k]; 

21 t *= t; 

22 t *= v; 

23 f[k] = t; 

24 } 

26 FFTC(f, ldn, -1); 

27 fourier shift(f, n, -w); 
28 } 


22.5 Convolution using the MFA 


We give an algorithm for convolution that uses the matrix Fourier algorithm (MFA, see section [21.10 on 
page 438). The MFA is used for the forward transform and the transposed algorithm (TMFA) for the 
inverse transform. The elements of each row are assumed to be contiguous in memory. In what follows let 
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R be the total number of rows and C the length of each row (equivalently, the total number of columns). 
For the sake of simplicity only auto-convolution is considered. 


22.5.1 The algorithm 
The MFA convolution algorithm: 

1. Apply a (length R) FFT on each column (stride-C memory access). 
Multiply each matrix element (index r,c) by exp(--o 2r ir c/n). 
Apply a (length C) FFT on each row (stride-1 memory access). 
Complex square row (element-wise). 


Apply a (length C) inverse FFT on each row (stride-1 memory access). 


o 5 fF wn 


. Multiply each matrix element (index r, c) by exp(—0 2m ir c/n). 


7. Apply a (length R) inverse FFT on each column (stride-C memory access). 
Note that steps 3, 4, and 5 constitute a length-C convolution on each row. 
With the weighted convolutions in mind we reformulate the method as weighted MFA convolution: 
1. Apply an FFT on each column. 
2. For each row r = 0, 1, ..., R—1, apply the weighted convolution with weight VC = e27r/R = 17/R, 
3. Apply an inverse FFT on each column. 


Implementations of this algorithm for the cyclic and linear convolution are given in [FXT: 


tion/matrixfftcnvl.cc|, the routines for self-convolution in [FXT: convolution/matrixfftcnvla.cc . 


We now consider the special cases of two and three rows and then formulate an MFA-based algorithm 
for the convolution of real sequences. 


22.5.2 The case R — 2 
Define s and d as the sum and difference of the left and right halves of a given sequence zx: 


s i= z2 gO (22.5-1a) 
d= 20/20) (22.5-1b) 


Then the cyclic auto-convolution of the sequence x can be computed by two half-length convolutions of 
s and d as 


1 
cer = z s@stde@-_d, s®s— d-d] (22.5-2) 


where the symbols ® and ®_ stand for cyclic and negacyclic convolution, respectively (see section [22.4] 
on page 448). The equivalent formula for the cyclic convolution of two sequences x and y is 


1 
ry = 2 [sz ® Sy + d, ®— dy, Sy ® Sy — d; ®— dy] (22.5-3) 
where 
Sp I= "UD ED 1U/2 (22.5-4a) 
d, :— qz 0/2 — y(1/2) (22.5-Ab) 
a (0/2) (1/2) 
Sy = Y +y (22.5-4c) 
dy = y2 — y 0/2 (22.5-Ad) 
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Now use the fact that a linear convolution can be computed by a cyclic convolution of zero-padded 
sequences whose upper halves are simply zero, so s, = dy = x and sy = dy = y. Then relation |22.5-3 
reads: 


1 
Tiny = 5 le@y+r@_y, 70y-78- y] (22.5-5) 


And for the acyclic auto-convolution: 


TG] X = [(r& z--rG6.r, rG6r—rG&.cz] (22.5-6) 


The lower and upper halves of the linear convolution can be computed as the sum and difference of the 
cyclic and the negacyclic convolution. 


22.5.3 The case R=3 


Let w = 4 (1 + i V3), a primitive third root of unity, and define 


A = q09., ¿UI 40/3 (22.5-7a) 
B = B) Ly 20/38) y wg (22.5-7b) 
O := gO) pwr) pw x) (22.5-7c) 


Let h := z & x and h(/9, 50/3. and A(/3 be the first, second, and last third of h, respectively. We 
have 


h) = A@®A+ B Oy B+ CO] O (22.5-8a) 
nt) = A@A+u* BOE Btw CO C (22.5-8b) 
he) = A@®A+w B®; B+ Ww? C Ga C (22.5-8c) 


For real-valued data C is the complex conjugate (cc.) of B and (with w? = cc.) BS.) B is the cc. of 
C G(,2) C and therefore every B®, B-term is the cc. of the C &(.j C-term in the same line. 


22.5.4 Convolution of real-valued data 


Consider the MFA-algorithm for the cyclic convolution (as given in section |22.5.1) with real input data: 
For row 0, which is real after the column FFTs, the (usual) cyclic convolution is needed. For row R/2, 


which is also real after the column FFTs, a negacyclic convolution (see section |25.7.4 on page 528) is 


needed (for odd R there is no such row). 


All other weighted convolutions involve complex computations, but we can avoid half of the work: as the 
result must be real, row R—r must be the complex conjugate of row r, due to the symmetries of the real 
and imaginary part of the Fourier transform of real data. Therefore we use real FFTs (R2CFTs) for all 
column-transforms in step 1 and half-complex to real FFTs (C2RFTs) in step 3. 


For even R we need one cyclic (row 0), one negacyclic (row R/2), and R/2 — 2 complex weighted 
convolutions. For odd R we need one cyclic (row 0) and (R — 1)/2 complex weighted convolutions. 
Now the cyclic and the negacyclic real convolutions involve about the same number of computations and 
that the cost of a weighted complex convolution is about twice as high. So the total work is about half 
of that for a complex convolution. 


For the computation of the linear convolution we can use the right angle convolution (and complex FFTs 
in the column passes), see section |22.4 on page 448 
22.5.5 Mass storage convolution 


Algorithms on data sets which do not fit into physical RAM are called external, out of core, or mass 
storage algorithms. We give a method for the mass storage convolution. 
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Let the array be organized as an R x C = N matrix. Assume that the available workspace (RAM) can 
hold N/a array elements (where a divides N). We consider only self-convolution to keep matters simple. 


We rewrite the weighted MFA convolution from section [22.5.1] as a mass storage algorithm. 
1. FFTs on columns: for k = 0,1, ..., a—1, do 
la. read the k-th part of all rows into RAM, 
1b. do the transforms on the columns in RAM, 
lc. and write back to disk. 
2. Weighted convolutions on rows: for r = 0, 1, ... & — 1, do 
2a. read the k-th part of all columns (one or more rows) into RAM, 
2b. do the weighted convolutions on the rows in RAM, 
2c. and write back to disk. 
3. Inverse FFTs on columns: (as in step 1, but use inverse transforms) 


We want to keep the number S of disk seeks minimal because they are slow. The choice of R determines 
S: In steps la and 1c there are R seeks each, for every value of k; same for steps 3a and 3c; giving 4Ra 
seeks. In steps 2a and 2c there are R seeks each; giving 2R seeks. We need a total of S = 2a + 4a R 
seeks for the whole computation. 


Therefore we choose R as small as possible and not close to VN as done when the array fits into RAM. 


The mass storage convolution as described was used for the calculation of the number 
99' z 04281247. 10369693,100 (22.5-9) 


on a 32-bit machine in 1999. The computation used two files of size 2 GB each and took less than eight 
hours on a system with an AMD K6/2 CPU at 366 MHz with 66 MHz memory. The log-file of the 


computation is [hfloat: examples/runl-pow999.txt). 


A computation of m to 2,700 billion decimal digits on an “inexpensive desktop computer” by Fabrice 
Bellard finished December 2009, setting the new world record. A mass storage convolution similar to the 
one described here was used. Technical details of this amazing feat are given in |42]. 


If multi-threading is available, one can use a double buffer technique: split the workspace into halves, one 
(CPU-intensive) thread does the FFTs in one half and another (hard disk intensive) thread reads and 
writes in the other half. This keeps the CPU busy during much of the hard disk operations, avoiding the 
waits during disk activity at least partly. 


22.6 The z-transform (ZT) 


The discrete z-transform (ZT) of a length-n sequence a is a length-n sequence c defined by 


c = Z[a] (22.6-1a) 


n—1l 
he xm X ds id (22.6-1b) 
z=0 


The z-transform is a linear transformation. It is not an orthogonal transformation unless z is a root 
of unity. For z = e*?7/^ the z-transform specializes to the discrete Fourier transform. An important 
property is the convolution property: 


Z[aeb] = Z[a]-Z[b] (22.6-2) 
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Convolution in original space corresponds to element-wise multiplication in z-space. This can be turned 
into an efficient convolution algorithm for the special case of the Fourier transform but not in general 


because no efficient algorithm for the inverse transform is known. 


22.6.1 Computation via convolution (Bluestein’s algorithm) 


Using the identity 


rk 


(+ 


1 
2 


+k?2-(k- zy) 


(22.6-3) 


we find the following expression for the element cz of the Fourier transform of the sequence a: 


n—1l 
.2 
eg = E a, z^ - P /2 
z—0 


The expression in brackets is a cyclic convolution of the sequence a, 27/2 with the sequence z^ 


n—1 


y (a. gem ¿Tay /2 


«2=0 


| | 


This leads to the algorithm for the chirp z-transform: 


1. Multiply the sequence a element-wise with z 


2. Convolve the resulting sequence with the sequence z7 


3. Multiply element-wise with the sequence Pu, 


x? /2 


z?/2 


2 


(22.6-4) 


x? /2 


The above algorithm constitutes a fast algorithm for the ZT because fast convolution is possible via FFT. 
The idea is due to Bluestein [56], a detailed description is given in [328]. 


22.6.2 Arbitrary length FFT by ZT 


The length n of the input sequence a for the fast z-transform is not limited to highly composite values. 
For values of n where an FFT is not feasible, pad the sequence with zeros up to a length L with L >= 2n 
such that a length-L FFT can be computed (highly composite L, for example a power of 2). 


-27i/n of the ZT, the chirp-ZT 


As the Fourier transform is the special case z = e" 


an FFT algorithm for sequences of arbitrary length. An implementation is [FXT: 


algorithm constitutes 
chirpzt/fftarblen.cc 


-is); 


1 void 

2 fft arblen(Complex *x, ulong n, int is) 

3  // Arbitrary length FFT. 

4 4 

5 const ulong ldnn = 1 + ld( (n << 1) - 1); 

6 const ulong nn = (1UL<<ldnn); // smallest power of 2 >= 2*n 

7 

8 Complex *f = new Complex [nn]; 

9 acopy(x, f, n); 

10 null(f+n, nn-n); 

11 

12 Complex *w = new Complex[nn]; 

13 make fft chirp(w, n, nn, is); 

14 multiply(f, n, w); 

15 

16 double *dw = (double *)w; 

17 for (ulong k-1; k«2*n; k+=2) dw[k] = -dw[k]; // ="= make fft chirp(w, n, nn, 
18 

19 fft complex convolution(w, f, ldnn); 

20 

21 if (ng 1) subtract(f, n, f*n); // odd n: negacyclic convolution 
22 else add(f, n, f*n); // even n: cyclic convolution 
23 

24 make fft chirp(w, n, nn, is); 

25 multiply(w, n, f); 

26 

27 acopy(w, x, n); 

28 delete [] w; 

29 delete [] f; 

30 > 


O oN DOTA UnA 


O 00D CUu 0 DFR 
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The auxiliary routine make_fft_chirp() is 


static inline void 
make_fft_chirp(Complex *w, ulong n, ulong nn, int is) 
// For k-0..n-1:  w[k] := exp( is * k*k * (i*2*PI/n)/2 ) where i = sqrt(-1) 
4 For k=n..nn-1: w[k] = 0 
double phi = 1.0*is*M_PI/n; // == (i*2*Pi/n)/2 
ulong k2 = 0, n2 = 2*n; 
for (ulong k=0; k<n; ++k) 
1 
w[k] = SinCos(phi*k2) ; 
k2 += (2*k*1); 
if ( k2>n2 ) k2 -= n2; 
// here: k2 == (k*k) mod 2*n; 
} 
null(wtn, nn-n) ; 


} 
The computation of a length-n ZT uses three FFTs with length greater than n. The worst case (if only 
FFTs for n a power of 2 are available) is n = 2? + 1: we need three FFTs of length L = 2?*! ~ 2n for 
the computation of the convolution. So the total work is about 6 times the work of an FFT of length n. 
It is possible to lower this worst case factor to 3 by using highly composite L slightly greater than n. 


For multiple computations of z-transforms of the same length one should precompute and store the 
transform of the sequence z^ /? as it does not change. Therefore the worst case is a factor 2 with highly 
composite FFTs and 4 if FFTs are available for powers of 2 only. 


22.6.3  Fractional Fourier transform by ZT 


The z-transform with z = e??7/^ is called the fractional Fourier transform in [29]. The term is usually 


used for the fractional order transform given as relation |25.11-6| on page see also [274] ch.13]. 


For o = +1 one again obtains the usual Fourier transform. The fractional Fourier transform can be used 
for the computation of the Fourier transform of sequences with only few nonzero elements and for the 
exact detection of frequencies that are not integer multiples of the lowest frequency of the DFT. 


A C++ implementation of the fractional Fourier transform for sequences of arbitrary length is given in 
[FX T: chirpzt/fftfract.cc|: 


void 
fft fract(Complex *x, ulong n, double v) 
// Fractional (fast) Fourier transform. 


const ulong ldnn = 1 + ld( (n << 1) - 1); 
const ulong nn = (1UL<<ldnn); // smallest power of 2 >= 2*n 


Complex *f - new Complex[nn]; 
acopy(x, f, n); 
null(f+n, nn-n); 


Complex *w = new Complex[nn]; 
make fft fract chirp(w, v, n, nn); 


for (ulong j=0; j<n; ++j) £f[jl *= w[jl; 


for (ulong j=0; j<nn; ++j) w[j] = conj(w[jl); 
fft complex convolution(w, f, ldnn); 

make fft fract chirp(w, v, n, nn); 

for (ulong j=0; j<n; ++j) w[jl *= f[j]; 
acopy(win, x, n); 

delete [] w; 

delete [] f; 


Y 
The auxiliary routine make, fft fract chirp(O is 


static inline void 
make fft fract chirp(Complex *w, double v, ulong n, ulong nn) 
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3 // For k-0..nn-1:  w[k] == exp(v*sqrt(-1)*k*k*2*pi*/n/2) 
4 (t 

5 const double phi = v*2.0*M PI/n/2; 

6 ulong n2 - 2*n; 

T ulong np=0; 

8 for (ulong k=0; k<nn; ++k) 

9 1 

10 w[k] = SinCos(phi*np); 

11 np += ((k<<1)+1); // np == (k*k)%n2 
12 if ( np>=n2 ) np -= n2; 

13 F 

14 } 


22.7 Prime length FFTs 


For the computation of FFTs for sequences whose length is prime we can exploit the existence of primitive 
roots. We will be able to express the transform of all but the first element as a cyclic convolution of two 
sequences whose length is reduced by one. 


Let p be prime, then an element g exists so that the least positive exponent e so that g^ = 1 mod p 
ise = p— 1. The element g is called a generator (or primitive root) modulo p (see section [39.6 on] 
page 776). Every nonzero element modulo p can be uniquely expressed as a power g* where 0 < e < p— 1. 
For example, a generator modulo p — 11 is g — 2, its powers are 


2,7 = 4, 9 =8,g9=5, g =10=-1,9=9 9 27,5 23,9?256,9"!z21 


Likewise, we can express any nonzero element as a negative power of g. Let h = g~', then with our 
example h = 6 and 


h? =1, ht =6, k? =3, k? =7, rar 210 1, hê =5, Ste = 4, h? 24-1 


This is just the reversed sequence of values. Let C be the Fourier transform of length-p sequence A: 


p—1 
Ck = 3 A os (22.7-1) 
x=0 


where W = exp (27 ¿/p) and o = +1 is the sign of the transform. We split the computation of the Fourier 
transform into two parts, we compute the first element of the transform as 


Co = MA (22.7-2) 


Now it remains to compute Ck for1<k<p-—1: 


p-1 
Cy = Ap+ Y A¿W""* (22.7-3) 


e=1 
Note the lower index of the sum. We write k = g° and z = g~/ (modulo p), so 


e 


mE og ig) «N^ o (^-^) : 
g)- ^e = 2 Ags)" g `) = LA g (22.7-4) 


The sum is a cyclic convolution of the sequences W* := w(9") and A* := A(g7") where 0 € w < p — 2. 


The main algorithm (ignoring the constant terms Ag and Co) can be outlined as follows: 
1. Compute A* and W* by permuting the sequences A and W. 
2. Compute C* as the cyclic convolution of A* and W*. 


«oO 00-1 O» Ot 4-05 NA 
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3. Compute W by permuting W*. 
The method is given in [277], it is called Rader’s algorithm. We implement it in GP: 


ft rader(a, is=+1)= 
\\ Fourier transform for prime lengths (Rader’s algorithm) 


1 


local(n, a0, cO, g, w); 
local(c, ixp, ixm, pa, pw, t); 
n = length(a); 


a0 = a[1]; cO = sum(j=1, n, a[j]); MAN constant terms 
\\ prepare permutations: 

g = znprimroot(n); ixp = vector(n, j, lift( g^(j-1) ) ); 
g = 8 (1); ixm = vector(n, j, lift( g*(j-1) ) D; 


\\ permute sequence W: 
w = is*2*I*Pi/n; pw = vector(n-1, j, exp(w*ixp[jl) >; 


\\ permute sequence A: 
pa = vector(n-1); for (j=1, n-1, paljl=a[1+ixpli+n-3]1 ); 


\\ cyclic convolution of permuted sequences: 
t = cconv(pa, pw); \\ cyclic convolution 


MV set C_0, and add A_O to each C_k: 
c = vector(n); c[1] = c0; for (k=1, n-1, c[1+k]=t[k]+a0); 


\\ permute to obtain result: 
t = vector(n); t[1] = c[1]; for (k-2, n, t[1+ixp[k-1]]=c[kx]); 
return( t ); 


With a (slow) implementation of the cyclic convolution and DFT we can check whether the method works 
by comparing the results: 


cconv(a, b)= 
/* Cyclic convolution (direct computation, n^2 operations) */ 
/* Example: cconv([a,b],[c,d]) ==> [b*d + cxa, axd + c*b] */ 


1 


} 


local(n, f, s, k, k2); 

n = length(a); f = vector(n); 

for (tau=0, n-1, \\ tau =k + k2 
s0 = 0; k=0; k2 = tau; 
while (k<=tau, sO += (a[k*i]*b[k2*1]); k++; k2--); 
s1 = 0; k2 = n-1; NN k-tau*i 
while (k«n, si += (a[k+1]*b[k2+1]); k++; k2--); 
f[tauti] = sO + si; 

); 


return( f ); 


dft(a, is=+1)= 
/* Complex Fourier transform (direct computation, n^2 operations) */ 


{ 


} 


local(n, f, s, phO, ph); 
n = length(a); f = vector(n); 
phO = is*2*Pi*I/n; 
for (k=0, n-1, 
ph = pho * K; 
f[k+1] = sum (x=0, n-1, a[x*i] * exp(ph*x) ); 
); 


return( f ); 


To turn the algorithm into a fast Fourier transform we need to compute the convolution via fast transforms 
of length (p — 1). This is trivially possible when p — 1 = 2%, for example when p = 5 or p = 17. As 
p — 1 is always divisible by 2, we can split at least once. For p = 11 we have (p — 1)/2 = 5 so we can 
again use Rader’s algorithm and length-4 transforms. The method can be used to generate code for short 
(prime) length FFTs. One should precompute the permuted and transformed sequence of the powers of 
the primitive root W. Therefore only two FFTs of length (p — 1) will be needed for a length-p transform. 
The algorithm is also an ingredient of the so-called prime factor FFT, see and [218]. 
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The Walsh transform and its 
relatives 


We describe several variants of the Walsh transform, sometimes called Walsh-Hadamard transform or just 
Hadamard transform. The Walsh transform has the same complexity as the Fourier transform but does 
not involve any multiplications. The XOR (dyadic) convolution that can be computed efficiently by the 
Walsh transform is introduced. We also give related transforms like the slant transform, the Reed-Muller 
transform, and the arithmetic transform. 


23.1 Transform with Walsh-Kronecker basis 


O: [k k k k k KK KK OK OK KOK X eR KOK KK Ok Ok OK k OK * k OK k *] 
1: [* * * * * * * * * * * * * * ] 
2: [* * * o * ok * ok * ok * ok * ok * ok ] 
3: [x * x * Ox * ok * ok * ok * ok * ok *] 
4: [k x * * x OK OK OK * k ok o* k ok k ok ij 
5: [x * * * Ox * * * Ox * * * ok * * ] 
6: [* * ko k ok x k ok * ok k ok ] 
7: [x * * * * * * * * x * * * ] 
8: [k k x * * * ok ok OK KK k k OK ] 
9: [* * * * * * * * ok * * * * * * ] 
10: [* * * ok * x x ok k ok * ok * ok * x*] 
ds ¡Ex * x * * ok * ok * * ok * * * x ] 
12: [* * * * xok k k k OK OK OK ox Kk] 
13: [* * * * * * ok * * * * * * *ok * ] 
14: [* * *ok Ok OK ok * ok * ko ok ] 
15: [* * * * * * ok * * * * *] 
16: [k * * * * ko OK k ok k ok * ] 
17: [* * * * * * * * * * * * * * *] 
18: [* * * ok * ok * ok * * ok * * x] 
19: [* * x * Ox * x * * * ok * * xx ] 
20: [k * * * * k OK OK k ok k ok * k k *] 
21: [x * * * ok * * * * * ok * * * ok * ] 
22: [* * k ok k ok * * ok ok * ok ok ok ] 
23: [x * * * * * * * * * Ox * ] 
24: [k k * * x *ok Ok KOK ok OK ] 
25: [x * * * * * * * * * * * ok * * ] 
26: [* * * Ox * x * ok * ok Ok k OK * ok ] 
27: [* * x * * Ox * ok * x * ok * * ok *] 
28: [k * * * * ok ok ok KOK kx k ok ok * ] 
29: [x * * * * * ok * * * ok * * * ] 
30: [* * * k ok k ok * ok k ok * ok ] 
31: [* * * * ok * * * * * ok * * ] 


Figure 23.1- A: Basis functions for the Walsh transform (Walsh-Kronecker basis). Asterisks denote the 
value 4-1, blank entries denote —1. 


A Walsh transform routine can be obtained by removing all multiplications (with sines and cosines) in a 
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FFT routine. We do so with a radix-2 decimation in time FFT: 


1 void slow walsh wak dit2(double *f, ulong ldn) 
2  // (this routine has a problem) 

3 t 

4 ulong n = (1UL««1dn); 

5 for (ulong ldm-1; ldm<=ldn; ++1ldm) 
6 

T const ulong m = (1<<1dm); 

8 const ulong mh = (m>>1); 

9 for (ulong j=0; j<mh; ++j) 

11 for (ulong r-0; r<n; r+=m) 
13 const ulong tí = r+j; 
14 const ulong t2 = t1+mh; 
15 double u = f[ti]; 

16 double v = f[t2]; 

17 f[ti] = u*v; 

18 f[t2] » u-v; 

19 } 

20 } 

21 } 

22 $ 


The transform involves O (n log,(n)) additions (and subtractions) and no multiplication at all. The 
transform, as given, is its own inverse up to a factor 1/n. The Walsh transform of integer input is 
integral. 


As the slow in the name of the routine suggests, the implementation has a problem as given. The memory 
access pattern is highly non-local. Let's make a slight improvement. We took the radix-2 DIT FFT code 
from section 1.2.13 on page dL4jand threw away all trigonometric computations (and multiplications). 
But the swapping of the inner loops, which we did for the FFT to save trigonometric computations, is 
now of no advantage anymore. So we try the following routine [FXT: [walsh/walshwakc2.hf: 


1 template <typename Type» 

2 void walsh_wak_dit2(Type *f, ulong ldn) 

3  // Transform wrt. to Walsh-Kronecker basis (wak-functions). 
4  // Radix-2 decimation in time (DIT) algorithm. 
5 t 

6 const ulong n = (1UL<<ldn); 

T for (ulong ldm-1; ldm<=ldn; ++1ldm) 

8 { 

9 const ulong m = (1UL<<ldm) ; 

10 const ulong mh = (m>>1); 

11 for (ulong r=0; r<n; r+=m) 

12 { 

13 ulong t1 = r; 

14 ulong t2 = r+mh; 

15 for (ulong j=0; j<mh; ++j, ++t1, ++t2) 
16 £ 

17 Type u = f[t1]; 

18 Type v = f[t2]; 

19 f[ti] = u + v; 

20 f[t2] = u - v; 

21 } 

22 } 

23 } 

24 } 


The impact on performance is quite dramatic. For n = 2?! (and type double, 16 MB of RAM) it gives a 
speedup by a factor of about 8. For smaller lengths the ratio approaches one. 


The data flow diagram (butterfly diagram) for the radix-2 decimation in time (DIT) algorithm is shown in 
D p The figure was created with the program [FXT: . The diagram 
for the decimation in frequency (DIF) algorithm is obtained by reversing the order of the steps. In the 
code, only the outermost loop has to be changed: 


1 template <typename Type» 

2 void walsh_wak_dif2(Type *f, ulong ldn) 
3 1 

4 const ulong n = (1UL««1ldn); 


E 
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Figure 23.1-B: Data flow for the length-16, radix-2, decimation in time (DIT) transform. The stages 


are from bottom to top. Thin lines indicate a factor of —1. 


for (ulong ldm-ldn; ldm>=1; --1dm) 
1 


[--snip--] // same block as in DIT routine 


} 


The basis functions are shown in fi 


sequence, see section|1.16.4 on page 44 


is [FXT: walsh/walsh-basis.h|: 
template «typename Type» 
void walsh wak basis(Type *f, ulong n, ulong k) 
for (ulong i-0; i«n; ++i) 
ulong x =i & K; 
= parity(x); 
f[i] = ( O==x ? +1 : -1 ); 
} 
} 


Multi-dimensional Walsh transform 


gure|23.1-Al The lowest row is (the signed version of) the Thue-Morse 
A routine that computes the k-th basis function of the transform 


If the row-column algorithm is used (see section [21.9.1 on page 437) to compute a 2-dimensional n x m 


Walsh transform, then the result is exactly the same as with a 1-dimensional transform of length n-m. 
That is, a k-dimensional nı x n2 x ... x nj transform is identical to a 1-dimensional transform of length 
ni:n2:...:ny&. The length-2” Walsh transform is identical to a n-dimensional length-2 Fourier transform. 


23.2  Eigenvectors of the Walsh transform t1 


The Walsh transforms are self-inverse, so their eigenvalues can only be 4 
W (a) denote the Walsh transform of a. Set 


-1. Let a be a sequence and let 


(23.2-1) 


E 
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0: [ +5 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 ] 
1: [ +1 +3 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 ] 
2: [ +1 +1 +3 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 ] 
3: [ +1 -1 -1 +5 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1] 
4: [ +1 +1 +1 +1 +3 -1 -1 -1 +1 +1 +1 +1 -1 -1 -1 -1 ] 
5: [ +1 -1 +1 -1 -1 +5 -1 +1 +1 -1 +1 -1 -1 +1 -1 +1 ] 
6: [ +1 +1 -1 -1 -1 -1 +5 +1 +1 +1 -1 -1 -1 -1 +1 +1] 
7: [ +1 -1 -1 +1 -1 +1 +1 +3 +1 -1 -1 +1 -1 +1 +1 -1 ] 
8: [ +1 +1 +1 +1 +1 +1 +1 +1 -5 -1 -1 -1 -1 -1 -1 -1 ] 
9: [ +1 -1 +1 -1 +1 -1 +1 -1 -1 -3 -1 +1 -1 +1 -1 +1 ] 
10: [+1 #1 -1 -1 41 +1 =1 =1 -1 =1 -3 41 -1 -1 +1 +1 ] 
11: [ +1 -1 -1 +1 +1 -1 -1 +1 -1 +1 +1 -5 -1 +1 +1 -1 ] 
12: [ +41 +1 +1 +1 -1 -1 -1 -1 -1 -1i -1 -1 -3 +1 *1 +i ] 
13: [ +1 -1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 +1 -5 +1 -1 ] 
14: [ +1 +1 -1 -1 -1 -1 +1 +1 -1 -1 +1 +1 +1 +1 -5 -1 ] 
15: [+1 -1 -1 +1 -1 +4 +1 -1 -1 +1 +1 -1 +1 -1 -1 -3] 


Figure 23.2-A: Eigenvectors of the length-16 Walsh transform (Walsh-Kronecker basis) as row vectors. 
The eigenvalues are +1 for the vectors 0...7 and —1 for the vectors 8...16. Linear combinations of 
vectors with the same eigenvalue e are again eigenvectors with eigenvalue e. 


Then 
W(u,) = W(W(a))+W(a) = a+W(a) = +1- u} (23.2-2) 


That is, uy is an eigenvector of W with eigenvalue +1. Equivalently, u.. := W (a) — a is an eigenvector 
with eigenvalue —1. Thus two eigenvectors can be computed for an arbitrary nonzero sequence. Note 
that with the unnormalized transforms the eigenvalues are +yn. 


We are interested in a simple routine that for a Walsh transform of length n gives a set of n eigenvectors 
that span the n-dimensional space. With a routine that computes the k-th basis function of the transform 
we can obtain an eigenvector by simply adding a delta peak at position k to the basis function. The delta 
peak has to be scaled according to whether a positive or negative eigenvalue is desired and according to 
the normalization of the transform. 


A suitable routine for the Walsh-Kronecker basis (whose basis functions are given in figure |23.1-A| on 
page |459) is 


1 void 

2  walsh wak eigen(double *v, ulong ldn, ulong k) 

3 // Eigenvectors of the Walsh transform (walsh wak). 
4  // Eigenvalues are +1 if k<n/2, else -1 

5 

6 i ulong n = 1UL << ldn; 

7 walsh_wak_basis(v, n, k); 

8 double d = sqrt(n); 

9 " v[k] += (k<n/2 ? +d : -d); 

0 


This routine is given in [FXT: |walsh/walsheigen.cc . Figure|23.2-A| was created with the program [FXT: 
fft /walsh-eigenvec-demo.cc|. 


23.3 The Kronecker product 


The length-2 Walsh transform is equivalent to the multiplication of a 2-component vector by the matrix 


+1 +1 | 


lr ocu (23.3-1) 
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The length-4 Walsh transform corresponds to 


Wa || Sear hae Sisto (23.3-2) 


MC EU 


One might be tempted to write 


W, = 


| Wo We | (23.3-3) 


+W: -W2 


This idea can indeed be turned into a well-defined notation which is quite powerful when dealing with 
orthogonal transforms and their fast algorithms. Let A be an m x n matrix 


Q0,0 Q0,1 EU Q0,n—1 
01,0 01,1 un Q1,n—1 
A = i , (23.3-4) 
Am-—1,0 G@m—-1,1 ***  Am-—ijm-1 


The (right) Kronecker product (or tensor product) with a matrix B is 


a0,0B aji1B >> — ao4 1B 
a1,0B a11B nmm 1,51 B 
AGB := ! (23.3-5) 
Gg, 1,0 B Am-—1,1B ES Qm —1,-1B 


There is no restriction on the dimensions of B. If B is an r x s matrix, then the dimensions of the given 
Kronecker product are (mr) x (ns). The entries of the matrix C are Ck+ir,l+js = i,j Dx 1. The Kronecker 
product is not commutative, that is, AG B Z B&A in general. 


For a scalar factor a the following relations are immediate: 


(aA)®B = a(A@B) (23.3-6a) 
A®(aB) = a(A@B) (23.3-6b) 


The next relations are the same as for the ordinary matrix product. Distributivity (the matrices on both 
sides of a plus sign must be of the same dimensions): 


(A+B)@C = A9QC+B8C (23.3-7a) 
A®(B+C) = A®B+A@C (23.3-7b) 

Associativity: 
A®&(BeC) = (A@®B)eC (23.3-8) 


The matrix product (indicated by a dot) of Kronecker products can be rewritten as 


(A@B)-(C@D) = (A-C)a@(B-D) (23.3-9a) 
Set Ly = Le =... = L, =: L and Ri = Ro =... = Rẹ, =: R in the latter relation to obtain 


(LƏR) = LEER” (23.3-9c) 
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The Kronecker product of matrix products can be rewritten as 


(A -B)@(C-D) (A@C)-(B@D) (23.3-10a) 
(Li - R1) 8(L2 : R2)8...89(Ln: Rn) = (11 @12®...@L,)- (Ri ®R2®...®R,) (23.3-10b) 


Here the matrices left and right from a dot must be compatible for ordinary matrix multiplication. 


We have 


(A & B)T AT & BT (233-114) 
(ASB)! = A&B! (23.3-11b) 


II 


If A and B are respectively m x n and r x s matrices, then 


AGB = (1,9B)-(A 8L) (23.3-12a) 
(A 81): (In @B) (23.3-12b) 


where L, is the n x n identity matrix. If A is n x n and B is t x t, then 


det(A & B) = det(A} det(B)" (23.3-13) 


Back to the Walsh transform, we have W; = [1] and for n = 25, n > 1: 


TW TW, 


W, = 
| TW —Wi,p 


| == W^» SW. 2 (23.3-14) 


To see that this relation is the statement of a fast algorithm, split the (to be transformed) vector x into 
halves 


= P4 (23.3-15) 


and write out the matrix-vector product 


(23.3-16) 


W,z = B M | E Be (zo + 11) 


Wi, zo — Wy» T1 W;2 (£o — 21) 
That is, a length-n transform can be computed by two length-n/2 transforms of the sum and difference 
of the first and second half of x. 

We define a notation equivalent to the product sign, 


GM, = MieGM;8MsO ... 8 M, (23.3-17) 
k=1 


where the empty product equals a 1 x 1 matrix with entry 1. If A = B in relation|23.3-11b| then we have 
(A8A)!—-A"!IGA,(AGAGA) 1 2A 18A^1& A^! and so on. That is, 


(8 a) ] = Q A! (23.3-18) 
k=1 k=1 


For the Walsh transform we have 


loga (n) 
W, = W» (23.3-19) 
kel 
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and 


w = Q wy (23.3-20) 
k=1 


The latter relation isn't that exciting as Wz! = W^» for the Walsh transform. However, it also holds if 
the inverse transform is different from the forward transform. Given a fast algorithm for some transform 
in the form of a Kronecker product, the fast algorithm for the inverse transform is immediate. 


The direct sum of two matrices is defined as 


A 0 
AGB := E 3 (23.3-21) 


In general A 6 B Z B 9 A. As an analogue to the sum sign we have 
BA = L9A (23.3-22) 
k=1 


where L, is the n x n identity matrix. The matrix I,, ® A consists of n copies of A that lie on the diagonal. 
The Kronecker product can be used to derive properties of unitary transforms, see [282]. In [236] the 
properties of the Kronecker product are used to develop all well-known algorithms for computing the 
Fourier transform. 


23.4 Higher radix Walsh transforms 


23.4.1 Generated transforms 


A generator for short-length Walsh (wak) transforms is given as [FXT: fft/gen-walsh-demo.cc|. It can 
create code for DIF and DIT transforms. For example, the code for the 4-point DIF transform is 


1 template <typename Type» 
2 inline void 

3 short_walsh_wak_dif_4(Type *f) 
4 4 

5 Type tO, t1, t2, t3; 
6 to = f[0]; 

7 ti = f[1]; 

8 t2 = £[21; 

9 t3 = f[3]; 

10 sumdiff( tO, t2 ); 
11 sumdiff( ti, t3 ); 
12 sumdiff( tO, t1 ); 
13 sumdiff( t2, t3 ); 
14 f[0] = t0; 

15 f[1] = t1; 

16 f[2] = t2; 

17 £[3] = t3; 

18 } 


To make the code more readable we use the function [FXT: aux0/sumdiff.h : 


template <typename Type> 

static inline void sumdiff(Type &a, Type &b) 
// fa, b) «--| fatb, a-b} 

{ Type t=a-b; at=b; b=t; } 


Bm CSKA 


We further need a variant that transforms elements which are not contiguous but lie apart by a distance s: 


template «typename Type» 
inline void 
Short walsh wak dif A4(Type *f, ulong s) 
1 
Type tO, t1, t2, t3; 
1 


NOD Cu WN e 


ulong x = 0; 
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8 tO = f[x]; x += s; 
9 ti = f[x]; x+=s; 
10 t2 = f[x]; x += s; 
11 t3 = f[x]; 

12 } 

13 sumdiff( tO, t2 ); 
14 sumdiff( ti, t3 ); 
15 sumdiff( tO, t1 ); 
16 sumdiff( t2, t3 ); 
17 1 

18 ulong x = 0; 

19 f[x] = t0; x += s; 
20 f[x] = t1; x += s; 
21 f[x] = t2; x += s; 
22 f[x] = t3; 

23 } 

24 $ 


The short-leng DIF and DIT variants) are given in [FX T: walsh/shortwalshwakdif.h| and 


[FXT: |walsh/shortwalshwakdit.h), respectively. A radix-4 DIF transform using these as ingredients is 
[FXT: [walsh/walshwak4.h): 


1 template <typename Type» 

2 void walsh wak dif4(Type *f, ulong ldn) 

3  // Transform wrt. to Walsh-Kronecker basis (wak-functions). 

4  // Radix-4 decimation in frequency (DIF) algorithm. 

A // Self-inverse. 

7 const ulong n = (1UL<<ldn); 

8 

9 if ( n<=2 ) 

10 

11 if ( n==2 ) short walsh, wak dif 2(f); 

12 return; 

13 

14 

15 for (ulong ldm-ldn; ldm>3; ldm-=2) 

16 1 

17 ulong m = (1UL<<ldm) ; 

18 ulong m4 = (m>>2); 

19 for (ulong r-0; r«n; r+=m) 
20 1 
21 for (ulong j=0; j<m4; j++) short walsh wak dif A4(f*jtr, m4); 
22 } 
23 } 
24 
25 if ( ildn & 1) // n is not a power of 4, need a radix-8 step 
26 { 
27 for (ulong i0=0; i0<n; i0+=8) short walsh wak dif 8(f*i0); 
28 } 
29 else 
30 { 
31 for (ulong i0-0; i0<n; i0+=4) short_walsh_wak_dif_4(f+i0) ; 
32 F 
33 } 


With the implementation radix-8 DIF transform some care must be taken to choose the correct final step 
size [FXT: walsh/walshwak8.h|: 


template <typename Type> 

void walsh_wak_dif8(Type *f, ulong ldn) 

// Transform wrt. to Walsh-Kronecker basis (wak-functions). 
// Radix-8 decimation in frequency (DIF) algorithm. 


1 

2 

3 

4 

5 // Self-inverse. 
6 4 

7 

8 

9 


const ulong n = (1UL««ldn); 


if ( n<=4 ) 
10 1 
11 switch (n) 
12 1 
13 case 4: short walsh wak dif 4(f); break; 
14 case 2: short walsh wak dif 2(f); break; 
15 } 
16 return; 
17 } 
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const ulong xx = 4; 
ulong ldm; 
for (ldm=ldn; ldm>xx; ldm-=3) 


ulong m = (1UL««1dm); 
ulong m8 = (m>>3); 
for (ulong r-0; r«n; r+=m) 


1 
for (ulong j=0; j<m8; j++) short walsh wak dif 8(f*j4r, m8); 

} 

F 

switch ( ldm ) 

{ 

case 4: 
for (ulong i0=0; i0<n; i0+=16) short_walsh_wak_dif_16(f+i0); 
break; 

case 3: 
for (ulong i0=0; i0<n; i0+=8) short_walsh_wak_dif_8(f+i0); 
break; 

case 2: 
for (ulong i0=0; i0<n; i0+=4) short_walsh_wak_dif_4(f+i0); 
break; 

} 


23.4.2 Performance 


For the purpose of performance comparison we include a matrix variant of the Walsh transform [FXT: 


walsh/walshwakmatrix.h): 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 


template <typename Type> 
void walsh_wak_matrix(Type *f, ulong ldn) 
1 
ulong ldc = (1dn>>1); 
ulong ldr = ldn-ldc; // ldr»-ldc 
ulong nc - (1UL««1dc); 
ulong nr = (1UL<<ldr); // nrow >= ncol 
for (ulong r=0; r<nr; ++r) walsh_wak_dif4(f+r*nc, ldc); 
transpose2(f, nr, nc); 
for (ulong c=0; c<nc; ++c) walsh_wak_dif4(f+c*nr, ldr); 
transpose2(f, nc, nr); 
} 


The transposition routine is given in [FX T: aux2/transpose2.h|. We only use even powers of 2 so the 


transposition is that of a square matrix. 


As for dyadic convolutions we do not need the data in a particular order so we also include a version of 
the matrix algorithm that omits the final transposition: 


template «typename Type» 
void walsh wak matrix 1(Type *f, ulong ldn, int is) 
1 
ulong ldc = (1dn>>1); 
ulong ldr = ldn-ldc; // ldr»-ldc 
if ( is<O ) swap2(ldr, ldc); // inverse 
ulong nc - (1UL««1dc); 
ulong nr = (1UL<<ldr); // nrow >= ncol 
for (ulong r-0; r«nr; ++r) walsh_wak_dif4(f+r*nc, ldc); 
transpose2(f, nr, nc); 
for (ulong c=0; c<nc; ++c) walsh_wak_dif4(f+c*nr, ldr); 
} 


The following calls give (up to normalization) the mutually inverse transforms: 


walsh, wak matrix 1(f, ldn, +1); 

walsh, wak matrix 1(f, ldn, -1); 
We do not consider the range of transform lengths n « 128, where unrolled routines and the radix-4 
algorithm consistently win. Figure shows a comparison of the routines given so far. There are 
clearly two regions to distinguish: firstly, the region where the transforms fit into the first-level data 
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cache (which is 64 kilobyte, corresponding to 1dn = 13). Secondly, the region where 1dn > 13 and the 
performance becomes more and more memory bound. 


In the first region the radix-4 routine is the fastest. The radix-8 routine comes close but, somewhat 
surprisingly, never wins. 


In the second region the matrix version is the best. However, for very large sizes its performance could 
be better. Note that with odd 1dn (not shown) its performance drops significantly due to the more 
expensive transposition operation. The transposition is clearly the bottleneck. One can use machine- 
specific optimizations for the transposition to further improve the performance. 


In the next section we give an algorithm that avoids the transposition completely and consistently out- 
performs the matrix algorithm. 


23.5 Localized Walsh transforms 


A decimation in time (DIT) algorithm combines the halves of the array, then the halves of the halves, 
the halves of each quarter, and so on. With each step the whole array is accessed which leads to a drop 
in performance as soon as the array does not fit into the cache. 


23.5.1 The method of localization 


We can reorganize the algorithm as follows: combine the halves of the array and postpone further 
processing of the upper half, then combine the halves of the lower half and again postpone processing of 
its upper half. Repeat until size 2 is reached. Then use the algorithm at the postponed parts, starting 
with the smallest (last postponed). 


For size 16 the scheme can be sketched as follows: 


hhhhhhhhhhhhhhhh 

hhhhhhhh44444444 

hhhn333344444444 

hh22333344444444 
The letters ‘h’ denote places processed before any recursive call. The blocks of twos, threes and fours 
denote postponed blocks. The Walsh transform is thereby decomposed into a sequence of Haar transforms 
(see figure [24.6-AJon page [508). The algorithm described is most easily implemented via recursion: 


template <typename Type> 
void walsh_wak_loc_dit2(Type *f, ulong ldn) 


1 
if ( ldn<i ) return; 
// Recursion: 
for (ulong ldm-1; ldm<ldn; ++ldm) walsh_wak_loc_dit2(f+(1UL<<1dm), ldm); 
for (ulong ldm-1; ldm<=ldn; ++1dm) 
{ 
const ulong m = (1UL««1dm); 
const ulong mh = (m>>1); 
for (ulong t1=0, t2-mh; ti<mh; ++t1, ++t2) sumdiff(f[ti], f[t2]); 
} 
} 


Rh 
CUA YN FO (00 DORA N e 


23.5.2 Optimizing the routine 


Avoiding recursions for small sizes gives a speedup. We use a radix-4 algorithm as soon as the transform 
fits into cache memory and avoid recursion for tiny transforms [FXT: walsh/walshwakloc2.h |: 
template <typename Type> 


void walsh wak loc dit2(Type *f, ulong ldn) 


if ( ldn<=13 ) // parameter: (2**13)*sizeof(Type) <= Li-cache 


Oc Ob. LA 


walsh, wak dif4(f,ldn); // note: DIF version, result is the same 
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Figure 23.4- A: Relative speed of different implementations of the Walsh (wak) transform. The trans- 


forms were run ‘rep’ times for each measurement. 


The quantity ‘dt’ gives the elapsed time for rep 


transforms of the given type. The quantity ‘MB/s’ gives the memory transfer rate as if a radix-2 algo- 
rithm were used; it equals ‘Memsize’ times '1dn' divided by the time elapsed for a single transform. The 
‘rel’ gives the performance relative to the radix-2 version, smaller values mean better performance. 
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T return; 
8 } 
9 
10 // Recursion: 
11 short_walsh_wak_dit_2(f+2); // ldm== 
12 short_walsh_wak_dit_4(f+4); // ldm==2 
13 short_walsh_wak_dit_8(f+8); // ldm==3 
14 short_walsh_wak_dit_16(f+16); // ldm== 
15 for (ulong ldm-5; ldm<ldn; ++ldm) walsh_wak_loc_dit2(f+(1UL<<ldm), ldm); 
16 
17 for (ulong ldm-1; ldm<=ldn; ++1ldm) 
18 { 
19 const ulong m = (1UL<<ldm) ; 
20 const ulong mh = (m>>1); 
21 for (ulong t1=0, t2=mh; ti<mh; ++t1, ++t2) sumdiff(f[t1], f[t2]); 
22 } 
23 ] 


A decimation in frequency (DIF) version is obtained by executing the inverse steps in reversed order: 


1 template «typename Type» 

2 void walsh, wak loc. dif2(Type *f, ulong ldn) 

3 

4 if ( ldn<=13 ) // parameter: (2**13)*sizeof(Type) <= Li-cache 
5 

6 walsh, wak dif4(f,ldn); 

7 return; 

E F 

10 for (ulong ldm=ldn; ldm>=1; --ldm) 

11 { 

12 const ulong m = (1UL<<ldm) ; 

13 const ulong mh = (m>>1); 

14 for (ulong t1=0, t2=mh; ti<mh; ++t1, ++t2) 
15 { 

16 Type u = f[t1]; 

17 Type v = f[t2]; 

18 f[ti] = u + v; 

19 f[t2] = u - v; 

20 } 

21 } 

22 

23 // Recursion: 

24 Short walsh wak dif 2(f42); // ldm-- 

25 short walsh wak dif 4(f*4); // ldm--2 

26 Short walsh wak dif 8(f48); // ldm--3 

2T Short walsh wak dif 16(f416); // ldm-- 

28 for (ulong ldm-5; ldm<ldn; ++ldm)  walsh wak loc. dif2(f*(1UL««1ldm), ldm); 
20 } 


The double loop in the algorithm is a reversed Haar transform, see chapter 
loop in the DIF algorithm is a transposed reversed Haar transform. The 


walsh/shortwalshwakdif.h) and [FXT: 


forms are given in the files [FXT: 


The double 


24 on page 497 


example, the length-8, decimation in frequency routine is 


Short walsh wak dif 8(Type *f) 


t3, t4, tb, t6, t7; 
f[1]; t2 = f[2]; t3 
f[5]; t6 = f[6]; t7 


); sumdiff( ti, t5 ); 
); sumdiff( ti, t3 ); 
); sumdiff( t2, t3 ); 
= ti; £[2] t2; £[3] 
= t5; f[6] t6; f[7] 


1 template <typename Type» 
: inline void 

4 

5 Type tO, ti, t2, 
6 tO = £[0]; t1 = 
7 t4 = f[41; t5 = 
8 sumdiff( tO, t4 
9 sumdiff( tO, t2 
10 sumdiff( tO, t1 
11 f[o] = to; £[1] 
12 f[4] = t4; £[5] 
13 7 


£[31; 
£[7]; 
sumdiff( t2, t6 ); sumdiff( t3, t7 ); 
sumdiff( t4, t6 ); sumdiff( tb, t7 ); 
sumdiff( t4, t5 ); sumdiff( t6, t7 ); 
= t3; 
= t7; 


'The strategy used leads to a very favorable memory access pattern that results in excellent performance 
for large transforms. Figure[23.5- A] shows a comparison between the localized transforms and the matrix 
algorithm. Small sizes are omitted because the localized algorithm has the same speed as the radix- 
4 algorithm it falls back to. The localized algorithms are the clear winners, even against the matrix 
algorithm with only one transposition. For very large transforms the DIF version is slightly faster, as 


23.5: Localized Walsh transforms 


14 == ldn; MemSize == 128 kB == 16384 doubles; rep == 2180 
walsh, wak matrix(f,ldn); dt- 0.672327 MB/s= 5674 rel= 1 
walsh_wak_matrix_1(f,ldn,+1); dt= 0.555851 MB/s= 6863 rel= 0.826756 
walsh, wak loc, dit2(f,1dn); dt- 0.498558 MB/s- 7652 rel- 0.741541 * 
walsh, wak loc dif2(f,ldn); dt- 0.533746 MB/s- 7148 rel- 0.793878 
16 == ldn; MemSize == 512 kB == 65536 doubles; rep == 477 
walsh, wak matrix(f,ldn); dt- 0.919579 MB/s= 4150 rel= 1 
walsh_wak_matrix_1(f,ldn,+1); dt= 0.692488 MB/s= 5511 rel= 0.753049 
walsh, wak loc, dit2(f,1dn); dt- 0.653256 MB/s- 5842 rel- 0.710386 * 
walsh, wak loc dif2(f,ldn); dt- 0.670104 MB/s- 5695 rel- 0.728707 
18 == ldn; MemSize == 2 MB == 256 K doubles; rep == 106 
walsh, wak matrix(f,ldn); dt- 2.2111 MB/s- 1726 rel- 1 
walsh, wak matrix 1(f,ldn,*1); dt- 1.36827 MB/s- 2789 rel- 0.618819 
walsh, wak loc, dit2(f,1dn); dt- 0.938006 MB/s- 4068 rel- 0.424225 
walsh, wak loc dif2(f,ldn); dt- 0.927804 MB/s- 4113 rel- 0.419611 * 
20 == ldn; MemSize == 8 MB == 1024 K doubles; rep == 24 
walsh, wak matrix(f,ldn); dt- 2.31178 MB/s- 1661 rel- 1 
walsh, wak matrix 1(f,ldn,*1); dt- 1.42614 MB/s- 2693 rel- 0.616901 
walsh, wak loc, dit2(f,1dn); dt- 1.11847 MB/s= 3433 rel= 0.483811 
walsh_wak_loc_dif2(f,1dn) ; dt= 1.11142 MB/s= 3455 rel= 0.480765 * 
22 == ldn; MemSize == 32 MB == 4096 K doubles; rep == 
walsh, wak matrix(f,ldn); dt- 2.00573 MB/s= 1755 rel- 1 
walsh, wak matrix 1(f,ldn,*1); dt- 1.23695 MB/s- 2846 rel- 0.616707 
walsh, wak loc dit2(f,ldn); dt- 1.16461 MB/s- 3022 rel- 0.580644 
walsh, wak loc dif2(f,ldn); dt- 1.16164 MB/s- 3030 rel- 0.579162 * 
24 == ldn; MemSize == 128 MB == 16384 K doubles; rep == 1 
walsh, wak matrix(f,ldn); dt- 2.16536 MB/s= 1419 rel= 1 
walsh_wak_matrix_1(f,ldn,+1); dt= 1.28455 MB/s= 2392 rel= 0.593226 
walsh_wak_loc_dit2(f,ldn); dt= 1.10769 MB/s= 2773 rel= 0.511552 
walsh_wak_loc_dif2(f,ldn); dt= 1.10601 MB/s= 2778 rel= 0.510775 * 


Figure 23.5-A: Speed comparison between localized and matrix algorithms for the Walsh transform. 


it starts with smaller chunks of data and therefore more of the data is in the cache when the larger 
sub-arrays get accessed. 


The localized algorithm can easily be implemented for transforms where a radix-2 step is known. Sec- 
tion|25.8 on page 529| gives the fast Hartley transform variant of the localized algorithm. 


Similar routines with higher radix can be developed. However, a radix-4 version was found to be slower 
than the given routines. A speedup can be achieved by unrolling and prefetching. We use the C-type 
double whose size is 8 bytes. Substitute the double loop in the DIF version (that is, the Haar transform) 


by 


// machine-specific prefetch instruction: 


#define PREF(p,o) 


ulong ldm; 


for (ldm=ldn; ldm>=6; --1dm) 
{ 


const ulong m 


= (1UL<<ldm) ; 


const ulong mh = (m>>1); 


asm volatile ("prefetchw " to "(%0) " 


PREF(f, 0); PREF(f+mh, 0); 

PREF(f, 64); PREF(f+mh, 64); 

PREF(f, 128); PREF(f+mh, 128); 

PREF(f, 192); PREF(f+mh, 192); 

for (ulong t1=0, t2-mh; ti<mh; t1+=8, t2+=8) 

1 
double *pi = f + ti, *p2 = f + t2; 
PREF(pi, 256); PREF(p2, 256); 
double u0 = f[ti1*0], vO = £[t2+0]; 
double ui = f[titi], vi = £[t2+1]; 
double u2 = f[ti*2], v2 = £[t2+2]; 
double u3 = f[ti*3], v3 = £[t2+3]; 
sumdiff(u0, vO); f[t1+0] = u0; £[t2+0] = vo; 
sumdiff(ui, vi); f[ti*1] = ui; f[t241] = vi; 
sumdiff(u2, v2); £[t1+2] = u2; £[t2+2] = v2; 
sumdiff(u3, v3); £[t1+3] = u3; £[t2+3] = v3; 


"rU (p) ) 
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double u4 = f[ti*4], v4 = f[t244]; 
double u5 = f[ti*5], vb = £[t2+5]; 
double u6 = f[ti*6], v6 = £[t2+6]; 
double u7 = f[ti*7], v7 = £[t2+7]; 
sumdiff(u4, v4); £[t1+4] = u4; £[t2+4] = v4; 
sumdiff(u5, vb); £[t1+5] = u5; £[t2+5] = vb; 
sumdiff(u6, v6); f[t1+6] = u6; £[t2+6] = v6; 
sumdiff(u7, v7); £[t1+7] = u7; £[t2+7] = v7; 
} 
} 
for ( ; ldm>=1; --ldm) 
1 
const ulong m - (1UL««1dm); 
const ulong mh = (m>>1); 
for (ulong t1=0, t2=mh; ti<mh; ++t1, ++t2) sumdiff(f[ti], f[t2]); 
} 
The following list gives the speed ratio between the optimized and the unoptimized DIF routine: 
14 == ldn; MemSize == 128 kB; ratio - 1.24252 
16 == ldn; MemSize == 512 kB; ratio = 1.43568 
18 == ldn; MemSize == 2 MB; ratio = 1.23875 
20 == ldn; MemSize == 8 MB; ratio = 1.21012 
22 == ldn; MemSize == 32 MB; ratio = 1.19939 
24 == ldn; MemSize == 128 MB; ratio = 1.18245 


For sizes that are out of (level-2) cache most of the speedup is due to the memory prefetch. 


23.5.3 Iterative versions of the algorithms 


DIF IF DIT IT 

start length start length 
E deis sid. picis 
esL t ls sus s EUR 
o usa. E O caps Wess 
e. .11. Deed. EE xis 
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Suns Lii. .1.11. Bere ee 
.1..1. cdo: i iori 
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sil sora eee .111.. daa 
.11.1. do. ds E 

T1... wil. dus wd E 

.1111. Sodas: a dues 


Figure 23.5-B: Binary values of the start index and length of the Haar transforms in the iterative 
version of the localized DIF (left) and DIT (right) transform. Dots are used for zeros. 


In the DIF algorithm the Haar transforms are executed at positions f 4-2, f 4- 4, f -- 6, ... and the length 
of the transform at position f + s is determined by the lowest set bit in s. Additionally, a full-length 
Haar transform has to be done at the beginning. As C++ code: 


template «typename Type» 
inline void haar dif2(Type *f, ulong n) 
1 
for (ulong m-n; m»-2; m>>=1) 
1 
const ulong mh = (m>>1); 
for (ulong t1=0, t2-mh; sumdiff(f[ti], £[t2]); 


ti«mh; ttti, ++t2) 


} 


COBNONBWNH 


f 
template <typename Type> 


void loc_dif2(Type *f, ulong n) 


{ 
haar_dif2(f, n); 
for (ulong z=2; z<n; 


z+=2) 


haar_dif2(f+z, 


(z&-z)); 
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Note that the routines now take the length of the transform as second argument, not its base-2 logarithm. 


With the DIT algorithm matters are slightly more complicated. A pattern can be observed by printing 
the binary expansions of the starting position and length of the transforms shown in figure|23.5-B| (created 


with [FXT: [fft/locrec-demo.cc|). The lengths are again determined by the lowest bit of the start position. 


And we have also seen the pattern in the left column: the reversed binary words in reversed lexicographic 
order, see figure on page[70| The implementation is quite concise: 
template «typename Type» 
inline void haar dit2(Type *f, ulong n) 
for (ulong m-1; m<=n; m<<=1) 


const ulong mh = (m>>1); 
for (ulong t1=0, t2-mh; ti<mh; ++t1, ++t2) sumdiff(f[ti], f[t2]); 


} 


OoN DNAN 


} 


H template <typename Type> 
12 void loc_dit2(Type f, ulong n) 


14 for (ulong z=2, u=1; z<n; z+=2) 
15 

16 ulong s = u<<1; 

17 haar dit2(f*s,  (s&-s)); 

18 u 7 prev lexrev(u); 

19 } 

20 haar_dit2(f, n); 

21 $ 


The routines are slightly slower than the recursive version because they do not fall back to the full Walsh 
transforms if the transform size is small. 
The DIT scheme is a somewhat surprising application of the seemingly esoteric routine prev_lexrev() 


in [FXT: bits/bitlex.h|. Plus we have found a recursive algorithm for the generation of the binary words 
in lexicographic order [F XT: |bits/bitlex-rec-demo.cc|: 


void bitlex_b(ulong f, ulong n) 


for (ulong m-1; m<n; m<<=1) bitlex_b(f+m, m); 
print bin(" ", f, ldn); 


OUR WN e 


23.6 Transform with Walsh-Paley basis 


A Walsh transform with a different ordering of the basis (see figure |23.6-A) can be computed by [FXT: 


walsh/walshpal.h): 


1 template <typename Type» 

2 void walsh_pal(Type *f, ulong ldn) 
3 

4 i const ulong n = 1UL<<ldn; 

5 revbin_permute(f, n); 

6 walsh_wak(f, ldn); 

T // == 

8 // walsh wak(f, ldn); 

9 // revbin permute(f, n); 

10 > 


Write Z for the zip permutation (see section [2.10 on page 125), and G for the Gray permutation (see 
section |2.12 on page 128), then we have 

W, = GW,G =G'W,G"* (23.6-1) 
ZWZ = ZWZ (23.6-2) 


A function to compute the k-th basis function of the transform is [FXT: walsh/walsh-basis.h|: 


1 template <typename Type» 
2 void walsh pal basis(Type *f, ulong n, ulong K) 


Re 
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O: [k k k k k KK KK KK OK KOK KOK KK KK OK KOK KOK KK KK KK k ] ( 0) 
1: [k k k k k k k k * * ok ok k ok ] (1) 
2: [x k k k k k k * OK k k KOK k k ] (3) 
3: [ * k k k k x k * * k * k xo] ( 2) 
4: [* * ox * * * Ok * k KOK * ok ok ok ] (7) 
b: [xo ox ox * * ok * k ok o* * ok *] ( 6) 
6: [ * * * * KOK k OK OK OK KOK * ok *] ( 4) 
7: [* * ox * * OK ok xk OK k k OK OK OK J] (5) 
8: [ * * * ok * Ox * x * Ox * ok * Ok * ok ] (15) 
9: [** *ok * *o *ok * ok * * ] (14) 
10: [ * * * ok * Ok *Ok OK ok * Ox * * ] (12) 
11: [xr * * ok * Ok LEES * ok * ok ok k ok ] (13) 
12; [ > * k ok k ok k ok * ok ok * * 7] ( 8) 
13: [ * * * ok ok * ok k ok k ok * ok ok ] (9) 
14: [xr * LEES * ok KOK * ok * ok * ok ok ] (11) 
15: [xr * * ok * ok ok ok ok * ok * * ] (10) 
16: [ * * * * * * * * * * * * * * * ] (31) 
17: [ * * * * * * * * * * * * * * * * ] (30) 
18: [ * * * * * * * ok * * * * * * * ] (28) 
19: [ * * * * * * * * * * * ok * * ] (29) 
20: [ * * * *ok * *ok * * * ok * * ] (24) 
21: [# * * *ok * * * * * ok * * * ok * ] (25) 
22: [ * * * * * k ok * * * * * * * ok 1:97) 
23: [ * * * * * ok * * * Ox * * * * * ] (26) 
24: [ x * ok *ok *ok * *o * ok * ok * ] (16) 
25: [ * * ok *ok k ok * * ok *ok * * ] (17) 
26: [ * * ok * * * Ok * * Ox * * * ] (19) 
27: [ * * ok * * * ok * ok * * LEES * ] (18) 
28: [ * * * ox * * LEES * * * * * *ox ] (23) 
29: [ * * *ox * * * ok * ok * * * * * ] (22) 
30: [ * * * Ox * * ok * * ok * * * ] (20) 
31: [ * * * ox * ok * * * ok * * ok * *o ] (21) 


Figure 23.6-A: Walsh-Paley basis. Asterisks denote the value +1, blank entries denote —1. 


3 1 

4 k = revbin(k, 1d(n)); 

5 for (ulong i-0; i«n; ++i) 

6 1 

7 ulong x = i & K; 

8 x = parity(x); 

9 f[i] = (02x ? +1 : -1 D); 
0 F 

1 3 


23.7 Sequency-ordered Walsh transforms 


The term corresponding to the frequency of the Fourier basis functions is the sequency of the Walsh 
functions, the number of the changes of sign of the individual functions. Note that the sequency of a 
signal with frequency f usually is 2 f. 


To order the basis functions by their sequency, use 


const ulong n = (1UL««ldn); 
walsh_wak(f, ldn); 
revbin_permute(f, n); 
inverse_gray_permute(f, n); 


That is 


BACON 


Ww = G!RW, = WRG (23.7-1) 


A function that computes the k-th basis function of the transform is [FXT: walsh/walsh-basis.h : 


1 template <typename Type» 
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O: [k k k k k KK KK KK OK KOK KOK KK KK OK KOK KOK KK KK KK k ] ( 0) 
1: [k k k k k k KOK KOK OK OK k k ok JC 31) 
2: [x k k k k x ko * k k * kx KK *K ] ( 2) 
3: [ * k k k k k k x * KOK k k k OK ]€3) 
4: [* * * * * k ok k ok * * kx * * ] (2) 
b: [xo ox * * * ok xR OK k k OK OK OK J] (5) 
6: [ x * * * x k OK OK * ok ok OK * kk ] (6) 
7: [xe * ox * * ok * * k KOK * k k OK ]6€6 70 
8: [ * * * ok ok OK ko OK Ok k ok * * ] ( 8) 
9: [** * ok ok OK * ok k ok OK ok Ok ok ok ] (9) 
10: [xr * * Ok x ok ok x ok k ok * Ok * * ] (10) 
11: [xr * * Ok * Ok * * * Ox * ok ok ] (11) 
12; [* > *ok * ok *Okok * ok * * * ] (12) 
13: [xr * *ok * ok *ok *ok * ok ok * ok ] (13) 
14: [xr * *ok * x * Ox * ok * * Ox * * ] (14) 
15: [xr * *ok * Ox * * Ox * ok * Ok * ok ] (15) 
16: [ * * ok *ok *ok * Ox * ok * Ox LEES * ] (16) 
17: [ * * ok *ok *ok * * * *ok * Ox ] (17) 
18: [ * * ok * * ok * * * ok * * ok * ] (18) 
19: [ * * ok * * ok * * * Ox * * * x ] (19) 
20: [ * * * Ox * ok * Ox * * ok * * * ] (20) 
21: E * * ox * ok * * * ok * * ok * * Ox J Gi 
22: [ * * * Ox * * * Ok * * * ok * * ] (22) 
23: [ * * * ok * * LEES * * *ok * * * ok ] (23) 
24: [ * * * *ok * * * x * * * Ox * * * ] (24) 
25: [ * * * *ok * * * *ok * * ok * ] (25) 
26: [ * * * * * * ok * * Ox * * * * ] (26) 
27: [ * * * * * *ok * * * * * * LEES * ] (27) 
28: [ * * * * * * * * * * * * * * * ] (28) 
29: [ * * * * * * * * * * ok * * * ] (29) 
30: [ * * * * * * * * * * * * * * * ] (30) 
31: [ * * * * * * * * * * * * * * * * ] (31) 


Figure 23.7-A: The Walsh-Kacmarz basis is sequency-ordered. Asterisks denote --1, and blanks —1. 


void walsh wal basis(Type *f, ulong n, ulong k) 


= revbin(k, ld(n)+1); 

= gray_code(k) ; 

// == 

2 k = revbin(k, 1d(n)); 
k= 


ER 
RO 00D DORADO IN 
N 
~ 


// 7 rev gray code(k); 
for (ulong i=0; i<n; ++i) 
1 f ulong x = i & K; 
12 x = parity(x); 
13 f[i] = ( 0==x ? +1 : -1 ); 
14 
15 $ 


A version of the transform that avoids the Gray permutation is based on [FXT: walsh/walshwal.h 


1 template <typename Type» 

2 void walsh_wal_dif2_core(Type *f, ulong ldn) 

3 // Core routine for sequency-ordered Walsh transform. 
4  // Radix-2 decimation in frequency (DIF) algorithm. 
5 t 

6 const ulong n = (1UL««1ldn); 

7 for (ulong ldm-ldn; ldm>=2; --1dm) 

8 1 

9 const ulong m = (1UL««1dm); 

10 const ulong mh = (m>>1); 

11 const ulong m4 = (mh>>1); 

12 for (ulong r=0; r<n; r+=m) 

13 1 

14 ulong j; 

15 for (j=0; j<m4; ++3) 

16 


17 ulong ti = r+j; 
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18 ulong t2 = t1+mh; 
19 double u = f[ti]; 
20 double v = f[t2]; 
21 f[ti] = u + v; 
22 f[t2] = u - v; 
23 } 
24 
25 for ( ; j<mh; ++j) 
26 1 
27 ulong ti = r+j; 
28 ulong t2 = tit+mh; 
29 double u = f[ti]; 
30 double v = f[t2]; 
31 f[ti] = u + v; 
32 f[t2] = v - u; // reversed 
33 F 
34 } 
35 } 
36 
37 if ( ldn ) 
38 { 
39 // ulong ldm=1; 
40 const ulong m = 2; //(1UL««1dm); 
41 const ulong mh = 1; //(m>>1); 
42 for (ulong r=0; r<n; r+=m) 
43 { 
44 ulong j = 0; 
45 // for (ulong j=0; j<mh; ++j) 
46 { 
47 ulong t1 = r+j; 
48 ulong t2 = t1+mh; 
49 double u = f[t1]; 
50 double v = f[t2]; 
51 f[ti] = u + v; 
52 f[t2] = u - v; 
53 } 
54 } 
55 } 
56 } 


O oND AUNE 


The transform still needs the revbin permutation: 


template <typename Type> 
inline void walsh_wal(Type *f, ulong ldn) 


revbin permute(f, (1UL««1dn)); 
walsh, wal. dif2 core(f, ldn); 

/ == 

/ walsh, wal. dit2 core(f, ldn); 

/ revbin permute(f, (1UL««1dn)); 


A decimation in time version of the core-routine is also given in [FXT: walsh/walshwal.h|. The procedure 
gray. permute() is given in section 2.12 on page 128 


23.7.1 Even/odd ordering of sequencies 


A transform with an alternative ordering of the basis functions (first even sequencies ascending, then odd 
sequencies descending) can be computed as follows [FXT: walsh/walshwalrev.h |: 
template «typename Type» 
inline void walsh wal rev(Type *f, ulong ldn) 
1 
revbin permute(f, (1UL««1dn)); 
walsh, wal. dit2 core(f, ldn); 
/]/ == 
tHE walsh_wal_dif2_core(f, ldn); 
// revbin permute(f, (1UL««1dn)); 
} 


COONDUBWNE 


This implementation uses the equality 


W, = RW R (23.7-2) 
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O: [ * k * * * KR OK KK OK KOK KOK Gk 0k KOK Gk 0k Gk KOK OK * xx * x ] ( 0) 
1: [k k k k k k k * * k k kk KK *K ] ( 2) 
2: [kk x * kK KK OK OK KOK * kk ] ( 4) 
3: [x * x * * * ok * o * * kk] (6) 
4: [ * * * ok ok Ok k ok * ok ok * * ] ( 8) 
5: [ * * * ok * ok ok ok k ok k ok * ok * * ] (10) 
6: [ * * * * Ok k ok k ok * ok * * * ] (12) 
7: [ox * * ok * Ox * ok * x * * * ] (14) 
8: [ * * ok * ok * ok * ok * ok * ok LEES * ] (16) 
9: [* * ok * * * Ok * ok * * * Ok * ] (18) 
10: [ * * *ox * * * ok * * * ok * * ] (20) 
11: [ * * * * * LEES * ok * * *ok * * ] (22) 
12: [ * * * *ok * * ok * * Ox * * * ] (24) 
13: [ * * * * * ok * * * Ox * * * * ] (26) 
14: [ * * * * * * * * x * * * * * * * ] (28) 
15: [ * * * * * * * * * * * * * * * ] (30) 
16: [ * * * * * * * * * * * * * * * ] (31) 
17: [ * * * * * * * * * * * ok * * ] (29) 
18: [ * * * * *ok * * * * * * LEES * ] (27) 
19: [ * * * *ok * * * *ok * LEES * ] (25) 
20: [ * * * * * * * * * * * *o ] (23) 
21: [ * * * ox * * * * ok * * o * *o ] (21) 
22: [ * * ok * * ok * ok * *o * * * ok ] (19) 
23: [ * * ok *ok * ok * * *ok * * ok J (17) 
24: [xk * * * ok * ok * * x * k * ok ] (15) 
25: [xk * * * ok *ok *ok k ok OK ok * ok ] (13) 
26: [ xk * * ok * k ok ok * ok * Ok OK ok ] (11) 
27: [* * * ok ok k ok ok ok x ok OK ok ] (9) 
28: [xk * * * x ok k k x k k ok * ok ok ok 107) 
29: [xk * * * x ok OK ok * * ok ok ok ok J] (5) 
30: [* k k * k ko x * * OK KOK KOK OK OK J] (3) 
31: [ k k k k k k kk k KOK KOK k k K J] € 4) 


Figure 23.7-B: Basis functions for the reversed sequency-ordered Walsh transform. Asterisks denote 
the value +1, blank entries denote —1. 


The same transform can be computed by either of the following sequences of statements (with 
n-1UL««1dn): 


1 { revbin permute(f, n); gray permute(f, n); walsh_wak(f, ldn); } 

2 { walsh wak(f, ldn); inverse gray permute(f, n); revbin permute(f, n); } 
3 

4 { zip rev(f, n); walsh wal(f, ldn); } 

5 { walsh wal(f, ldn); unzip rev(f, n); } 

6 


The corresponding identities are 


Wy = WGR = RG W, (23.7-3a) 
Wut = ZW, (23.7-3b) 


Similar relations as for the transform with Walsh-Paley basis (23.6-1|and|23.6-2 on page 473) hold for Ww: 


Ww 


GW,G-—-G W,G (23.7-4a) 
= ZW,Z2ZiW,Z (23.7-4b) 


The k-th basis function of the transform can be computed as [F XT: |walsh/walsh-basis.h): 


1 template <typename Type» 

2 void walsh_wal_rev_basis(Type *f, ulong n, ulong k) 
3 4 

4 k = revbin(k, 1d(n)); 
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5 k = gray. code(k); 

6 // ="= 

7 // k = rev. gray. code(X) ; 

8 // k = revbin(k, 1d(n)); 

9 for (ulong i=0; i<n; ++i) 
10 { 

11 ulong x = i & K; 

12 x = parity(x); 

13 f[i] = ( O==x ? +1 : -1 ); 
14 F 

15 } 


23.7.2 Transforms with sequencies n/2 or n/2—1 


0: [* * * * * ok * * ok o* * ok * 
1: [— * * * * * Ox * * * * 
2: [* * * * * ok OK * * * 
3: [ * * ok ok * * * 
4: [ * * x * * * * ok ok * * 
5: [ox * * * * ok ok * 
6: [ x ok ok * * * * 

7: [xoxo * * * * x * ok * * 
8: [ * * * k ok ok * ok ok X * ok * * 
9: [ * * * * ok * ok ok X * * ok * 
10: [ * * * * * * * * Ox 

ite [ * * k OK * ok * k k ok * * ok 

12- [ * * ok * * * * 
13: [ * * k OK * ok OK * k k * * 
14: [xr * kk k ok * ok * * * * ok 
15: [ * * * * * * * ok 
16: [ * * * ok ok * * * * Ox 
17: [ * * * x * * ok * * 
18: [ * * * ok * k k ok * ok * Ok * 

19: [ ok k * * * * 
20: [ * * ok * * * * 
231: [ * * * ok * * ok ok OK * ok 
22: [ k ok k k * ok * k ok * * 

23: [ * * * * ko OK * 

24: [ * x ok ok * * * 
25: [ * * k ok o* * ok * k ok 
26: [ * * * ok ok * * ok k ok ok 
27: [ * * ok ok ok * LEE ok * Ok ok OK 
28: [ * LEE NE: * * * ko OK * 
29: [ * x ok ok * * * * k ok 
30: [ * ok ok * * * * x 

31: [ * * ok ok * * ok * k k o* * * 


RX o * * 


RX X * 
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Figure 23.7-C: Basis functions for a self-inverse Walsh transform that has sequencies n/2 and n/2 — 1 


only. Asterisks denote the value +1, blank entries denote —1. 


The next variant of the Walsh transform has the interesting feature that the basis functions for a length-n 


transform have only sequencies n/2 and n/2 — 1 at the even and odd indices, respec 
23.7-C| The transform is self-inverse and can be computed via [FXT: 


shown in figure 


1 template <typename Type» 

2 void walsh_qi(Type *f, ulong ldn) 
3 1 

4 ulong n = 1UL << ldn; 

5 grs negate(f, n); 

6 walsh gray(f, ldn); 

7 revbin_permute(f, n); 

8 ] 


The routine walsh gray is given in [FXT: walsh/walshgray.h!: 


1 template <typename Type» 
2 void walsh_gray(Type *f, ulong ldn) 


tively. The basis is 
walsh/walshq.h 


23.7: Sequency-ordered Walsh transforms 
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0 1 2 3 4 5 6 7 8 9 10 11 


MXM MMM 


10 11 


8 9 10 11 


DAE 13 14 15 
12 13 14 15 
12 13 14 15 
K K 
12 13 14 15 


ldm=4 


ldm=3 


ldm=2 


ldm=1 


Figure 23.7-D: Data flow for the length-16 Walsh-Gray routine. 


3 1 

4 const ulong n = (1UL<<ldn); 

5 for (ulong ldm-ldn; ldm>0; --ldm) // dif 
6 { 

7 const ulong m = (1UL««1dm); 

8 for (ulong r=0; r<n; r+=m) 

9 

10 ulong ti = r; 

11 ulong t2- r* m- 1; 

12 for ( ; t1<t2; ++t1,--t2) 
13 1 

14 Type u = f[t1]; 

15 Type v = f[t2]; 

16 f[t1] = u + v; 

17 f[t2] = u - v; 

18 } 

19 } 

20 } 

21 7 


The data flow is shown in figure |23.7-D| note how the halves of the sub-arrays are accessed in mutually 


reversed order. 


A basis with sequencies n/2 for the first half of the functions and sequencies n/2 — 1 for the second half 


is shown in figure |23.7-E| The corresponding transform can be computed by [FXT: |walsh/walshq.hi: 


template «typename Type» 
void walsh q2(Type *f, ulong ldn) 
1 


ulong n = 1UL << ldn; 
revbin_permute(f, n); 
grs_negate(f, n); 
walsh gray(f, ldn); 


00 JD BAUN 


// grs_negate(f, n); 
revbin_permute(f, n); 
walsh_gray(f, ldn); 


The transform could be computed by the following statements: 
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0: [ * * * k ok k k * 
1: [— * * *ok * k ok ok 
Ze [ox * * * * 
3: [ * ok ok * * * 
4: [xk * * * Ok ok Ok k ok 
5: [— * * * * 

6: [ * * k OK * 

7: [ox * ok ok ok * ok 

8: [ * * * ok ok * 
9: [ * * * ok * ok OK OK * 
10: [ * * * ok ok 

dls [ Ok OK ok * ok * ok ok 

12: [ * * * 
13: TE * * * ok ok 
14: [ * * ok ok * 
15: [ * ok ok * * 
16: [ * x ok ok * * Ok * ok 
17: [ * * ok ok * 
18: [ * * x * ok ok * ok 
19: [ * * * 
20: [ * * * * 
21: [ * * koX * 
22: [ * ok o* * * 
23: [ * * * ok ok * 
24: [ * * * * k ok 
25: [ * * k ok * 
26: [ * ok ok * ox k ok * 
27: [xk * * ok ok k ok * 
28: [ * * * * x k ok * Ox 
29: [xk * * * * * 
30: [ * * * 
31: [xr * * * ok * * * 
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Figure 23.7-E: Basis functions for a self-inverse Walsh transform (second form) that has sequencies n/2 
and n/2 — 1 only. Asterisks denote the value +1, blank entries denote —1. 


The basis functions of the transforms can be computed as follows [FXT: walsh/walsh-basis.h|: 


t 


ulong n = 1UL << ldn; 
revbin permute(f, n); 
walsh qi(f, ldn); 

revbin permute(f, n); 


emplate «typename Type» 


void walsh q1 basis(Type *f, ulong n, ulong K) 


1 


} 


and 
t 


ulong qk = (grs_negative_q(k) ? 1: 


k = gray_code(k) ; 
k = revbin(k, 1d(n)); 
for (ulong i=0; i<n; ++i) 
1 
ulong x = i & xk; 
x = parity(x); 


0); 


ulong qi = (grs negative q(i) ? 1 : 


x “= (qk ^ qi); 


f[i] = ( O==x ? +1 : -1 +); 


emplate <typename Type> 


0); 


void walsh_q2_basis(Type *f, ulong n, ulong k) 


{ 


ulong qk = (grs_negative_q(k) ? 1 
k = revbin(k, ld(n)); 

k = gray. code(k); 

for (ulong i-0; i«n; ++i) 


: 0); 
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8 1 

9 ulong x = i & k; 

10 x = parity(x); 

11 ulong qi = (grs negative q(i) ? 1 : 0); 
12 x “= (qk ^ qi); 

13 f[i] = ( 0==x ? +1 : -1 ); 

14 } 

15 } 


The function grs. negative qO is described in section|1.16.5 on page 44 
23.8 XOR (dyadic) convolution 


XOR-convolution 
T^ 2 4 5 6 7 8 9 1011 12 13 14 15 
QO: 0 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 
1: 1 0 3 2 5 4 7 6 9 81110 13 12 15 14 
2: 2 3 0 1 6 7 4 5 1011 8 9 14 15 12 13 
3: 3 2 1 0 7 6 5 4 1110 9 8 15 14 13 12 
4: 4 5 6 7 O 1 2 3 12 13 14 15 8 9 10 11 
5: 5 4 7 6 10 3 2 13 12 15 14 9 8 11 10 
6: 6 7 4 5 2 3 0 1 14151213 1011 8 9 
T: 7 6 5 4 3 2 1 0 15 1413 12 1110 9 8 
8: 8 9 1011 12 13 14 15 O 1 2 3 4 5 6 7 
9: 9 81110 13 12 15 14 1032 5 4 7 6 
10: 1011 8 9 14 15 12 13 2 3 0 1 6 7 4 5 
11: 1110 9 8 15 14 13 12 3 2 1 0 7 6 5 4 
12: 12 13 14 15 8 9 10 11 4 5 6 7 O 1 2 3 
13: 13 12 15 14 9 8 11 10 5 4 7 6 1032 
14: 14 15 1213 1011 8 9 6 7 4 5 2 3 0 1 
15: 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 0 


Figure 23.8- A: Semi-symbolic scheme for the XOR-convolution of two length-16 sequences. 


The dyadic convolution of the sequences a and b is the sequence h defined by 


he = Y db (23.8-1a) 
ipj-r 
= X ubie (23.8-1b) 


where the symbol ‘@’ stands for bit-wise XOR operator. The dyadic convolution has an XOR where 


the usual one has a plus, see relations |22.1-1b| and |22.1-2| on page [440] it could rightfully be called 
XOR-convolution. 


The semi-symbolic scheme of the convolution is shown in figure The table is equivalent to the 
one (for cyclic convolution) given in figure|22.1-A|on page The dyadic convolution can be used for 
the multiplication of hypercomplex numbers as shown in section [39.14 on page 815 


A fast algorithm for the computation of the dyadic convolution uses the Walsh transform [FXT: 
walsh/dyadicenvl.h : 


1 template <typename Type» 

2 void dyadic_convolution(Type * restrict f, Type * restrict g, ulong ldn) 
3  // Dyadic convolution (XOR-convolution): h[] of f[] and gll: 
4  //  h[x] = sum( i XOR j == k, flil*g[k] ) 

5 // Result is written to gl]. 

6  // ldn := base-2 logarithm of the array length 

7 1 

8 walsh wak(f, ldn); 

9 walsh wak(g, ldn); 

10 const ulong n = (1UL<<ldn); 

11 for (ulong k-0; k<n; ++k) glk] *- f[k]; 

12 walsh wak(g, ldn); 
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T O 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 
0: O 1 2 3 4 5 6 7 8 91011 12 13 14 15 
i: 103 2 5 4 7 6 9 81110 13 12 15 14 
2: 2 3 0 1 6 7 4 5 1011 8 9 14 15 12 13 
3: 3 2 10 76 5 4 1110 9 8 15 14 13 12 
4: 45 6 7 O0 1 2 3 12131415 8 910 11 
5: 5 4 7 6 1 O 3 2 1312 15 14 9 8 11 10 
6: 6 7 4 5 2 3 0 1 14151213 1011 8 9 
T: 7 6 5 4 3 2 10 15141312 1110 9 8 
8: 8 91011 1213 1415 0- 1- 2- 3- 4- 5- 6- 7- 
9: 9 8 11 10 13 12 15 14 I= 0= 3- 2— 5- 4- T= 6- 
10: 1011 8 9 1415 1213 2- 3- 0- 1- 6- 7- 4- 5- 
11: 1110 9 8 15.14 13 12 3- 2- 1- 0-= 7- 6- 5- 4- 
12: 12 13 14 15 8 91011 4- 5- 6- 7- 0- 1- 2- 3- 
13: 13121514 9 81110 5- 4- T- 6 1- 0- 3- 2- 
14: 14151213 1011 8 9 6- 7- 4- 5- 2- 3- O- 1- 
15: 15 14 13 12 1110 9 8 7- 6- 5- 4- 3- 2- 1- 0- 


Figure 23.8-B: Semi-symbolic scheme for the dyadic equivalent of the negacyclic convolution. Negative 
contributions to a bucket have a minus appended. 


A scheme similar to that of the negacyclic convolution is shown in figure|23.8-B| It can be computed via 


walsh wal. dif2 core(f, ldn); // note walsh wal variant used 

walsh wal dif2 core(g, ldn); 

ulong n = (1UL««1dn); 

for (ulong i=0,j=n-1; i<j; --j,**i) fht mul(f[il, f[j], glil, g[j], 0.5); 
walsh, wal dit2 core(g, ldn); 


where fht_mul () is the operation used for the convolution with fast Hartley transforms [FXT: 


tion/fhtmulsqr.h|: 


oR WME 


1 template <typename Type» 

2 static inline void 

3  fht mul(Type xi, Type xj, Type &yi, Type &yj, double v) 

4  // yi <-- v*( (yi + yj)*xi + (yi - yj)*xj ) == v*C (xi + x3)*yi + (xi - xj)*yj ) 
5 // yj <-- v*C Cyi + yj)*xi + (yi + yj*xj ) == v*(C (-xi + xj)*yi + (xi + xj) *yj ) 
6 (t 

7 Type hip = xi, him = xj; 

8 Type si = hip + him, di = hip - him; 

9 Type h2p = yi, h2m = yj; 

10 yi = (h2p * si + h2m * d1) * v; 

11 yj = (h2m * si - h2p * d1) * v; 

12 7 


23.9 Slant transform 


The slant transform can be implemented using a Walsh Transform and just a little pre/post-processing 
[FXT: walsh/slant.cc!: 


void slant(double *f, ulong ldn) 


1 
walsh wak(f, ldn); 


ulong n = 1UL<<ldn; 
for (ulong ldm-0; ldm«ldn-1; ++1dm) 
1 


ulong m = 1UL<<ldm; //m= 1, 2, 4, 8, ..., n/4 
double N m*2, N2 = N*N; 


PRR eRe RRR 
0 OD) TUBO NA C 00 00 -1 CO» Ou O2 2 5 


double a = sqrt(3.0*N2/(4.0*N2-1.0)); 
double b = sqrt(1.0-a*a); // == sqrt ((N2-1)/(4*N2-1)); 
for (ulong j=m; j<n-1; j+=4*m) 

ulong ti = j; 

ulong t2 = j + m; 

double f1 = f[ti], f2 = f[t2]; 

f[ti] =a * fi - b * £2; 

f[t2] = b * f1 + a * £2; 


23.10: Arithmetic transform 


20 h 
21 } 


Apart from the Walsh transform only an amount of work linear with the array size has to be done: 


inner loop accesses the elements in strides of 4, 8, 16, ..., 277!, 


The inverse transform is: 


1 void inverse slant(double *f, ulong ldn) 

2 A 

3 ulong n = 1UL<<ldn; 

4 ulong ldm-ldn-2; 

5 do 

6 1 

7 ulong m = 1UL<<ldm; // m = n/4, n/2, ..., 4, 2, 1 

8 double N = m*2, N2 = Nx*N; 

9 double a = sqrt(3.0*N2/(4.0*N2-1.0)); 
10 double b = sqrt(1.0-a*a); // == sqrt((N2-1)/(4*N2-1)) ; 
11 for (ulong j=m; j<n-1; j+=4*m) 

12 1 
13 ulong tí = j; 

14 ulong t2 = j + m; 

15 double f1 = f[ti], £2 = f[t2]; 
16 f[t1] = b * f2 +a * f1; 

17 f[t2] =a * f2 - b * fi; 

18 } 

19 
2 while ( ldm-- ); 

22 walsh wak(f, ldn); 
23 ] 


A sequency-ordered version of the transform can be implemented as follows: 


1 void slant_seq(double *f, ulong ldn) 
2 1 

3 slant(f, ldn); 

4 ulong n = 1UL<<ldn; 

5 inverse_gray_permute(f, n); 

6 unzip_rev(f, n); 

7 revbin_permute(f, n); 

8 $ 


This implementation can be optimized by combining the involved permutations, see [345]. 


The inverse is computed by calling the inverse operations in reversed order: 


1 void inverse. slant. seq(double *f, ulong ldn) 
2 A 

3 ulong n = 1UL<<ldn; 

4 revbin permute(f, n); 

5 zip rev(f, n); 

6 gray_permute(f, n); 

7 inverse_slant(f, ldn); 

8 P 


23.10 Arithmetic transform 
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the 


There are two (mutually inverse) forms of the arithmetic transform, denoted by Y * and Y”. Their basis 


functions are shown in figure |23.10-4 


A routine for the transforms can be obtained by simple modifications in a Walsh transform: 


Walsh: f[ti] = u + v; f[t2] = u - v; 
Y(+): f[ti] = u; f[t2] = u + v; 
Y(-): f[t1] =u i f[t2] = v - u; 


A routine for Y * is [FXT:|walsh/arithtransform.h : 


1 template <typename Type» 
2 void arith transform plus(Type *f, ulong ldn) 
3  // Arithmetic Transform (positive sign). 


Re 
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0: [****c- cz 4+ ¢+ 4+ o 4+ 4+ 44+ 4+ 4+ 4 0: [+--+ --++-+--+] 
1: [ + + + + + + + + ] 1: [ + - + - + + - ] 
Del ++ ++ + + + +] 2: [ +- + - + +-] 
30 [ + + + +] 3: [ + - - +] 
4: [ ++++ +++ + ] 4: [ + =++-] 
5: [ + + + +] 5: [ - - +] 
6: [ tot + +] 6: [ - - +] 
7: [ + +] TeL + =] 
8: [ ++++++++4] 8: [ +--+-++-] 
9: [ + + + + ] 9: [ + - - +] 
10: [ + + + +] 10: [ + - - +] 
11: L + +] 11: [ + =] 
12: [ ++++] 12: [ +--+] 
13: [ + +] 13: [ + -] 
14: [ ++] 14: [ ts] 
15: [ +] 15: [ +] 

Figure 23.10-A: Basis functions for the transform Y * (left) and Y ^ (right). The values are +1, or 0 

(blank entries). 

4  // Radix-2 decimation In Frequency (DIF) algorithm. 

5 t 

6 const ulong n = (1UL««1dn); 

7 for (ulong ldm-ldn; ldm>=1; --1dm) 

8 

9 const ulong m = (1UL««1dm); 

0 const ulong mh = (m>>1); 

1 for (ulong r=0; r<n; r+=m) 

2 

3 ulong ti = r; 

4 ulong t2 = r+mh; 

5 for (ulong j=0; j<mh; ++j, ttti, ++t2) 

6 { 

H Type u = f[t1]; 

8 Type v = f[t2]; 

9 f[t1] = u; 

0 f[t2] = u + v; 

1 } 

2 } 

3 } 

4 } 

5 

The transform Y~ can be computed similarly: 

1 template <typename Type» 

2 void arith transform minus(Type *f, ulong ldn) 

3  // Arithmetic Transform (negative sign). 

4  // Radix-2 decimation In Frequency (DIF) algorithm. 

5 // Inverse of arith transform plus(). 

6 

7 [--snip--] 

8 f[t1] = u; 

9 f[t2] = v - u; 

0 [--snip--] 

1 

The length-2 transforms can be written as 

+1 0 a a 
+ = = E 
Y u = | 41 41 | | Bes wl Londen (23.10-1a) 
2 +1 0 a a 
Yov = E E] | b |) ewe (23.10-1b) 
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In Kronecker product notation (see section|23.3 on page 462) the transforms can be written as 
loga (n) 1 0 
+ o + +] 7 
i e - Y; where Y; = Es Ed (23.10-2a) 
log, (n) 4] 0 
Y, = (OQ Yz where Y; = | zi 3H | (23.10-2b) 
k=1 
The k-th element of the arithmetic transform Y * is 
Yala = doa (23.10-3a) 
iCk 
where i C k means that the bits of i are a subset of the bits of k: à C k «— (i ^k) — i. For the 
transform Y ^ we have 
Y lala = (-199V7(-199a,; = V (-19*79 a; (23.10-3b) 
iCk iCk 
where p(x) is the parity of x. 
23.10.1 Reversed arithmetic transform 
0: [ + ] 0: [ + ] 
1: [+ + ] 1: [+ ] 
2: [+ + ] Ep wu ] 
3: [++++ ] 3: [+--+ ] 
4: [+ + ] 4: [- + ] 
bsc + + ] 5: [+ - = + ] 
6: [+ + + + ] 6: [+ - - + ] 
T: C++++4+ 4+ 4+ 4 ] 7: [-++-+--+4+ ] 
8: [+ + ] 8: [ - * ] 
9: [++ + + ] 9: [+- - + ] 
10: [+ + + + ] 10: [+ - " ] 
11: [* +++ t+ 4+ ] 11: [-++- + - + ] 
12: [ + + + + ] 12: [ + - - + ] 
13: [++ ++ ++ ++ ] 13: [ - + + - +- - + ] 
14: [+ + + + + + + č + ] 14: [ - + + - + - + ] 
15: [+++++++++++++++] 15: [+--+-++--+ -+--+] 
Figure 23.10-B: Basis functions for the transform B* (left) and B- (right). 
We define the (mutually inverse) reversed arithmetic transforms B+ and B^ via 
loga (n) 41 +1 
B} = (E Bi where Bj = | en | (23.10-4a) 
k=1 
log, (n) +1 -1 
B, = (EQ By where By = | A | (23.10-4b) 
k=1 
The k-th element of the transform BT is 
B's = So ai = N d (23.10-5) 


ICR kCi 


where k = n — 1 — k is the complement of k: we have e Cf => f C e. 


A routine for the transform B* is [FXT: walsh/arithtransform.h 
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1 template <typename Type» 

2 void rev_arith_transform_plus(Type *f, ulong ldn) 
3 

4 [--snip--] 

5 f [t1] = u + v; 

6 f[t2] = v; 

T [--snip--] 

8 


The omitted lines are identical to the routine for Yt. The same transform could be computed by the 
statements: 


ulong n-1UL««ldn; 
reverse(f,n); arith transform plus(f,ldn); reverse(f,n); 


The inverse B^ is computed as follows: 


1 template <typename Type» 

2 void rev arith transform minus(Type *f, ulong ldn) 
3 // Inverse of rev arith transform plus(). 

4 

5 [--snip--] 

6 f[t1] = u - v; 

7 f[t2] = v; 

8 [--snip--] 

9 


23.10.2 Conversion to and from the Walsh transform 1 


To establish the relation to the Walsh transform recall that its decomposition as a Kronecker product is 


logs (n) 
+1 +1 
Wn = (E) Wa where W = | Adi | (23.10-6) 
k=1 
We have (W Y *) Y - = W, and the expression in parentheses is the matrix that converts the arithmetic 


transform Y” to the Walsh transform. Similarly, (3 Yt W) W = Y *, gives the matrix for the conversion 
from the Walsh transform to the arithmetic transform Y *. We only need length-2 transforms to obtain 
the conversions: 


(YU Y- = W= | T i | Y (23.10-7a) 
(WY) Yt = W= P. ^e [ser (23.10-7b) 
+2 -1 ) 
Tes zi Gece | ABE eel 
l xk FE ENTE +1 
(sv w) WS YS, | Ho | W (23.10-7d) 


The Kronecker product of the given matrices gives the converting transform. For example, using rela- 


tion |23.10-7a| define 


Tu = 


log5(n) 
| (23.10-8) 


+2 +1 

0 —1 
k=1 
Then Tn converts an arithmetic transform Y ^ to a Walsh transform: W,, = Tn Y,,. The relations between 
the arithmetic transform, the Reed-Muller transform, and the Walsh transform are treated in 1 


23.11 Reed-Muller transform 


The Reed-Muller transform is obtained from the arithmetic transform by working modulo 2: replace all 
+ and - by XOR. The transform is self-inverse, its basis functions are identical to those of the arithmetic 


transform Y +, shown in figure |23.10-A]on page An implementation is [FXT: walsh/reedmuller.h |: 
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1 template <typename Type» 

2 void word_reed_muller_dif2(Type *f, ulong ldn) 
3  // Reed-Muller Transform. 

4  // Radix-2 decimation in frequency (DIF) algorithm. 
5 // Self-inverse. 

6  // Type must have the XOR operator. 

T 

8 const ulong n = (1UL<<ldn); 

9 for (ulong ldm-ldn; ldm>=1; --1dm) 

10 1 

11 const ulong m = (1UL<<ldm) ; 

12 const ulong mh = (m>>1); 

13 for (ulong r=0; r<n; r+=m) 

14 

15 ulong t1 = r; 

16 ulong t2 = r+mh; 

17 for (ulong j=0; j<mh; ++j, ++t1, ++t2) 
18 { 

19 Type u = f[t1]; 
20 Type v = f[t2]; 
21 f[t1] = u; 
22 f[t2] = u ^ v; 
23 } 
24 } 
25 } 
26 } 


As given, the transforms work word-wise. A version for the bit-wise transform is 


1 template <typename Type» 

2 inline void bit, reed muller(Type *f, ulong ldn) 

3 4 

4 word, reed, muller dif2(f, ldn); 

5 ulong n = 1UL << ldn; 

6 for (ulong k-0; k<n; ++k) f[k] = yellow code(f[k]); 
T} 


The yellow_code() (see section|1.19 on page 49) may also be applied before the main loop. In fact, the 


yellow code is the Reed-Muller transform on a binary word. 


The other ‘color-transforms’ of section lead to variants of the Reed-Muller transform, the blue code 
gives another self-inverse transform, the red code and the green code give transforms R and E so that 


RRR = id, R*=RRB=E (23.11-1a) 
EEE = id, E*=BE=R (23.11-1b) 
RE = ER=id (23.11-1c) 


As can be seen from the matrix relations|1.19-12c|. . .|1.19-12f|on page[55] the four transforms are obtained 


by the following replacements: 


Walsh: f[ti] =u + v; f[t2] =u - v; 
B: f[t1] =u ^ v; f[t2] = v; (reversed Reed-Muller transform) 
Y: f[t1] = u; f[t2] = u ^ v; (Reed-Muller transform) 
R: f[ti] = v; f[t2] =u ^ v; 
E: f[ti] = u ^ v; f[t2] = u; 


The basis functions of the transforms are shown in figure|23.11-A 
For example, if we make the following changes in the routines walsh_wak_dit2() in the file [FXT: 


walsh/walshwak2.h , we obtain a Reed-Muller transform: 


Walsh: f[ti] --» Reed-Muller: f[t1] 
Walsh: f[t2] --» Reed-Muller: f[t2] 


u + v; 
u- V; 


u v; 
For the decimation in time algorithm, make the very same changes in walsh_wak_dit2(). 
The replacements for the reversed Reed-Muller transform are: 


Walsh: f[ti] 
Walsh: f[t2] 


u + v; 
u-v; 


--» reversed Reed-Muller: f[t1] u^v; 
--» reversed Reed-Muller: f[t2] v 
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blue yellow red 
rev. Reed Muller Reed Muller 
a a wees mtm 1111111111111111 1111111111111111 
T4 coca Bek Ieee E ae Lod dtd 1d 1.1.1.1.1.1.1.1. 
1.1: cet oilasl1d:.11..11 11..11..:11..11.. 
1111.. E A E E Le aloen lcs 
1:21. 1111....1111 1111....1111 
TA AS Jim 1:43 — AS O 
1:1.1.1..:.:::-: amenat 11. 11 11... 11... 
11111111... —- — — 9 pp 1 ducenda eatin aa 
Wo ik A EAE e 11111111 11111111........ 
11.1.2210 Rex 1.1.1.1 A O AA 
Tic A A E 11..11 a ES E AA 
1111....1111..... — e 1...1 A scisssteteu sie a 6 
O A a ee rens 1111 T1. eure i 
11:.211..11..11.. — eem 1.1 T2lo.os:e we 114.214.214.511 
1.1.1.1.1.1.1.1..— re 11 d.c ine E :1.1.1.1.1.1.1.1 
1111111111111111 —  ........-.- es 1 diua sek a aa Be tee 1111111111111111 


Figure 23.11-A: Basis functions of the length-16 blue, yellow, red, and green transforms. 


The symbolic powering idea from section [1.19 on page 49| leads to transforms with the following bases 


(using length-8 arrays and the yellow code): 


T1: 1 1 e E 1.1.1.1 11... 11..11 1111 11111111 

lile 1 1 1.1 eL 11.1 a Os 1. .1.1 i Ta 

1:1. 1 1 1: dus PE e 11..11 odd ..11..11 

1 1 1 seals 2.2 ss 1::.1 sast 1.1 

1 1 1.1 mae aL end 11.. .11 ;..1111 1111 

chido 1 xn seins ad. ea ll ES er suce ose scscdease d sl iaceo ss dad. 

——— 1 naX] ea A A: El simile LA esas dad A hd 

a 1 ON Mec eum NS MEORUM naa Sees, A tad road 
x=0 x=1 x=2 x=3 x= x=5 x=6 x=7 


The program [FXT: bits/bitxtransforms-demo.cc gives the matrices for 64-bit words. 
A function that computes the k-th basis function of the transform is [FXT: walsh/reedmuller.h : 


template <typename Type> 
inline void reed_muller_basis(Type *f, ulong n, ulong k) 


for (ulong i=0; i<n; ++i) 


ffi] = ( (i & k)==k ? +1 : 0); // is k a subset of i (as bitset)? 


MIDA NA 


} 


Functions that are the word-wise equivalents of the Gray code are given in [FXT: jaux1/wordgray.h : 


1 template <typename Type» 
2 void word gray(Type *f, ulong n) 
3 1 
4 for (ulong k-0; k<n-1; ++k) f[k] “= f[k*1]; 
5 P 
and 
1 void inverse word gray(Type *f, ulong n) 
2 A 
3 ulong x = 0, k=n; 
4 while ( k-- ) {x ^= f[k]; f[k] = x; > 
5 


As one might suspect, these are related to the Reed-Muller transform. Writing Y (‘yellow’) for the Reed- 
Muller transform, g for the word-wise Gray code and S; for the cyclic shift by k words (word zero is 
moved to position k) we have 


YEGY =g (23.11-2a) 
TELF = ye (23.11-2b) 
TES = g" (23.11-2c) 


These are exactly the relations [1.19-10al... |1.19-10c on page 53|for the bit-wise transforms. 


The power of the word-wise Gray code is ea to the bit-wise version: 


COND dC b. 


Pee 
Ne OO 00-I1O»O0 i 0b. r- 
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template «typename Type» 
void word gray pow(Type *f, ulong n, ulong x) 
1 


for (ulong s-1; s«n; s*-2) 
if (x&1) 


// word gray ** s: 
for (ulong k-0, j-k*s; j<n; ++k,++j) f[k] ^= f[j]; 
x >>= 1; 
} 
} 


Let e be the reversed Gray code operator, then we have for the reversed Reed-Muller transform B: 


Bigs eg (23.11-3a) 

BS,B =e (23.11-3b) 

BS,B = e * (23.11-3c) 
Further, 

ES,R = e (23.11-4a) 

EER = Sy (23.11-4b) 


The transforms as Kronecker products (all operations are modulo 2): 


B, = © Bə where B= | à i | (23.11-5a) 
pos 1 0 

Y, = Q) Y» where Y= | 1 d (23.11-5b) 
pou 0 1 

Rn = (9) Rz where R= | 1 3 (23.11-5c) 
Pc 1 1 

En = ® E; where E= | i al (23.11-5d) 


23.12 The OR-convolution and the AND-convolution 
Let a and b be sequences of length a power of 2. We define the OR-convolution h of a and b as 


h = J aib (23.12-1) 


1Vj=T 


where V denotes bit-wise OR. The symbolic table for the OR-convolution is shown in figure |23.12-A]| (see 
figure|22.1-A| on page for an explanation of the scheme). The OR-convolution can be computed via 


template «typename Type» 

inline void slow or.convolution(const Type *f, const Type *g, ulong ldn, Type *h) 
// Compute the OR-convolution h[] of f[] and g[]: 

// h[k] = sum(i | j == k, flil*gljl) 

d Result written to h[]. 


const ulong n = 1UL << ldn; 
for (ulong j=0; j<n; ++j) h[jl = 0; 
for (ulong i=0; i<n; ++i) 
for (ulong j=0; j<n; ++j) 
h[iljl += £04] * g[j]; 
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Figure 23.12-A: Semi-symbolic scheme for the OR-convolution (top) and the AND-convolution (bot- 


tom) of two length-16 sequences. 


'The following relation is the key to the fast computation of the OR-convolution: 


(23.12-2) 


23.10 on page 483| An implementation 


h = Y-[Y*[a]- Y*(9] 


Here Yt and Y” denote the arithmetic transforms given in section 


is [FXT: walsh/or-convolution.h : 


(Type * restrict f, Type * restrict g, ulong ldn) 
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Define the AND-convolution h of two sequences a and b as 


(23.12-3) 


where ^ denotes the bit-wise AND. The symbolic scheme is shown in figure|23.12-A| The AND-convolution 


can be computed as 


ao 


O 0 DOHA NA 


&O 00 -1 O» Ot i» C2 h2 A 


23.13: The MAX-convolution t 491 


template «typename Type» 

inline void slow and convolution(const Type *f, const Type *g, ulong ldn, Type *h) 
// Compute the AND-convolution h[] of f[] and gl]: 

// bik] = sumi & j == k, flil*gLjD 

d Result written to h[]. 


const ulong n = 1UL << ldn; 
for (ulong j=0; j<n; ++j) h[jl = 0; 
for (ulong i=0; i<n; ++i) 
for (ulong j=0; j<n; ++j) 
à h[i&j] += fli] * gljl; 


'The key to fast computation is the following relation: 
h = B [B*[a]B*([b]] (23.12-4) 


Here B+ and B^ denote the reversed arithmetic transforms. The implementation of the AND-convolution 


is [FXT: walsh/and-convolution.h|: 


template «typename Type» 
inline void and, convolution(Type * restrict f, Type * restrict g, ulong ldn) 


1 


rev arith transform plus(f, ldn); 

rev arith transform plus(g, ldn); 

const ulong n = (1UL««ldn); 

for (ulong k=0; k<n; ++k) gl[k] *- f[k]; 
rev arith transform minus(g, ldn); 


23.13 The MAX-convolution ¢ 
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Figure 23.13-A: Semi-symbolic scheme for the MAX-convolution of two length-8 sequences. 


Let a and b be sequences of length n (not necessarily a power of 2). We define the MAX-convolution h 
of a and b as 


h = J ub (23.13-1) 


max(i,j)=T 


The computation by definition involves O(n?) operations [FXT: walsh/max-convolution.h : 


template <typename Type> 

inline void slow max convolution(const Type *f, const Type *g, ulong n, Type *h) 
// Compute the MAX-convolution h[] of f[] and gl]: 

// hlk] = sum( max(i,j) == k, flil*gLjD 

H Result written to h[]. 


for (ulong j=0; j<n; ++j) h[j] = 0; 
for (ulong i=0; i<n; ++i) 
for (ulong j=0; j<n; ++j) 
h[ max2(i,j) ] += f[i] * gljl; 
} 


Duraid Madina [priv. comm.] asks whether the MAX-convolution can be computed faster than O (n?). 
Indeed, the structure (see figure ]23.13-A} is so simple that it can be computed in linear time: 


Re 


RO 000 C RO NA 
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inline void max convolution(const Type *f, const Type *g, ulong n, Type *h) 


492 
template «typename Type» 
1 
Type sf=0, sg=0; // cumulative sums 
for (ulong k-0; k<n; ++k) 
h[k] = f[kl*g[k] + sf*g[k] + sg*f[k]; 
sf += f[k]; 
sg += g[kl; 
} 
} 


23.14 Weighted arithmetic transform and subset convolution 


23.14.1 The weighted arithmetic transform 
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Figure 23.14-A: Basis for the weighted arithmetic transform Y *, dots denote zeros. 


We define the weighted arithmetic transform Y * and its inverse Y” as 


The k-th element of the weighted arithmetic Y ^ transform is 


Y la]; = St a; 


log. (n) 


= Q Y; where Y; 


k=1 
logs (n 


) 
= Y; where Y, 


k=1 


iCk 


| 
| 


del 
+1 


+1 


0 


—1/w +1/w 


(23.14-1a) 


| (23.14-1b) 


(23.14-2a) 


where i C k means that the bits of i are a subset of the bits of k, and c(i) is the number of ones in the 
binary expansion of i. For the transform Y ^ we have 


Y la. = wT) N (199a, = 


iCk 


iCk 


Car aia (23.14-2b) 


E 
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where p(x) is the parity of x. The basis functions are shown in figure |23.14-A| Note that the power of w 
is identical for each column. The pattern is 

0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4... 
This is the sequence of the number of ones in the binary expansions of the natural numbers, entry A000120 
in [312]. We can compute the parity transform of f by first multiplying the sequence 

wO wi wi w2 wi w2 w2 w3 wi w2 w2 w3 w2 w3 w3 %wW4... 


element-wise to f and then use the (unweighted) transform Y * [FXT: walsh/weighted-arithtransform.h : 


template «typename Type» 
void arith transform plus(Type *f, ulong ldn, Type w) 
// Weighted arithmetic transform (positive sign). 


if ( w!-(Type)i ) bit count weight(f, ldn, w); 
arith transform plus(f, ldn); 


NOOR cbr 


} 


The routine for the multiplications with powers of w is [FXT: walsh/bitcount-weight.h : 


1 template <typename Type» 

2 void bit count weight(Type *f, ulong ldn, Type w) 

3  // Multiply f[i] by w**bitcount(i). 

4 t 

5 ALLOCA(Type, pw, ldn*1); // powers of w 

6 pw[0] = (Type)1; 

7 for (ulong j=1; j<=ldn; ++j) pwlj] = w * pwlj-1]; 

8 const ulong n = (1UL««1ldn); 

9 for (ulong j=1; j<n; ++j) f[j] *= pw[ bit count(j) 1; 
0 4} 


'To compute the inverse transform, use 


1 template <typename Type» 

2 void arith transform minus(Type *f, ulong ldn, Type w) 
3  // Weighted arithmetic transform (negative sign). 

4 // Inverse of (weighted) arith transform plus(). 

5 t 

6 arith transform minus(f, ldn); 

th, if ( w!=(Type)1 ) bit count weight(f, ldn, 1.0/w); 
8 ] 


23.14.2 Subset convolution 


We want to compute the subset convolution s of the sequences a and b, defined as 


s = 3 ab; (23.14-3) 


iVj=r, iAj=0 


The definition is similar to the OR-convolution, but the condition ¿A j = 0 (no intersecting subsets) 
makes matters more complicated. Figure [23.14-B] shows the symbolic scheme, note that many products 
a;b; do not appear at all in the subset convolution. The total number of products a;b; is N 3 for N a 
power of 2. It may seem that computing fewer products (than N4, as with the OR-convolution) would 
allow for a method even cheaper than O (N log N), but no such scheme is known. We develop a method 
that is O (N (log N)?). 


Define the weighted OR-convolution h(w) of a and b as 


hw), = Y wl ab; (23.14-4) 
IN J=T 
The symbolic table for the convolution with w = —1 is shown in fi 23.14-C} The positive entries appear 


where the basis of the Walsh transform is positive, see figure 23.1-ATon page 
weighted OR-convolution by definition [FXT: |walsh/weighted-or-convolution.h/: 
template <typename Type> 


inline void slow_weighted_or_convolution(const Type *f, const Type *g, ulong ldn, 


We can compute the 
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[O 1 3 4 5 6 7 8 910 11 12 13 14 15 ] 
[1 5 7.9 .41 .13 .15 .] 
[2 3 6 7 . .1011 . . 14 15 ] 
[3 7 11 . . . 15 .] 
[4 5.6 7 . . . . 12 13 14 15 .] 
[5 13 . 15 .] 
C6 T^ ce stt as a tAds .] 
[ 7 15 .] 
[8 9 10 11 12 13 14 15 .] 
[ 9 11 . 13 . 15 "A 
[10 11 . . 14 15 .] 
[11 .. 2 . 15 .] 
[12 13 14 15 .] 
[13 . 15 .] 
[14 15 .] 
[15 .] 


Figure 23.14-B: Semi-symbolic scheme for the subset convolution. Dots denote unused products. 


weighted (w--1) OR-convolution, positive entries: 
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Tales ET 51-215. 5 . 11 11 245.5... 
12: 12131415 . . . . sc. vs 42 139 14. 15 
13% 19.15 —.2 13..:..15 . 13 . 1513 .15 . 
14: 4314 15. ; . 1 1114-15 . . 14 15 14 15 . 
15: 15 . 15 . 1515 . . 1515 .15 . 15 
weighted (w--1) OR-convolution, negative entries: 
i O 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
dis L «5. 23> 2 50ft-$. T, so Oo oo dd o 139. 15 
2: s i2 Q9 0. fy 006 T a , 1011 . . 14 15 
3: A x0 xo. f 1i 11 .. .. 15415... 
4: E" 4 5 6 s tee AD DS 14:15 
5: Di. we uL. X5 x. f 13 . 1513 .15 . 
6: oe BE vos Tf, ss . 14 15 14 15 . 
T: 7 T do, i 7 1515 .15 . 15 
8: £o 8 9 10 11 12 13 14 15 
9: 9 . 11 .13 . 15 9. utt LS dle e 
10: . 1011 . . 1415 1011 . 1415 . . 
11: It di th Tord: |. 11-52. 11.15 . 15 
12: elec 12 49 14 15. 12.13 14 45 . .. . 
I3: 13 . 15 13 15 . 13 .15 . .13 . 15 
14: . 14151415 . . 1415 . . . . 14 15 
15: 15-15. . 15- . . 15. 15^ ; . 15 15-15. . 


Figure 23.14-C: Semi-symbolic scheme for the weighted OR-convolution with w = —1, separated into 
positive (top) and negative (bottom) entries. 
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Type *h, Type w) 
// Compute the weighted OR-convolution h[] of f[] and gl]: 
// h[k] = sum(i | j == k, flil*g[jl * (w)**bitcount (ikj)) 
H Result written to h[]. 


ALLOCA (Type, pw, ldn+1); // powers of w 
pw[0] = (Type)1; 
for (ulong j=1; j<=ldn; ++j) pwlj] = w * pw[j-1]; 
const ulong n = 1UL << ldn; 
for (ulong j=0; j<n; ++j) h[j] = 0; 
for (ulong i=0; i<n; ++i) 
for (ulong j=0; j<n; ++j) 
h[ilj] += f[i] * g[j] * pw[ bit_count( ig j ) ]; 
} 


A fast algorithm is based on the relation 
hw) = Y-|Y*la]-Y*(] (23.14-5) 


Here is the implementation: 


template <typename Type> 
inline void weighted_or_convolution(Type * restrict f, Type * restrict g, ulong ldn, Type w) 
{ 
arith_transform_plus(f, ldn, w); 
arith_transform_plus(g, ldn, w); 
const ulong n = (1UL««ldn); 
for (ulong k-0; k<n; ++k) glk] *= f[k]; 
arith_transform_minus(g, ldn, w); 
} 
[ow doe te hs Be the tee Pat 
[ $1.13 a ete y dd] 
[.112.112.112.112] 
Las ee LOD Lee es 1111) 
Ls $ ;.1 1212 1. 11212] 
[ b ff o1 2 2 en 1111221 
[.1121223.1121223] 
[gies aw E dod. lod do 
[.1.1.1.112121212] 
[- 2 La. 1111221122] 
[.112.11212231223] 
Lori 1 1011141112222] 
[.1.1121212122323] 
[..11112211222233] 
[.112122312232334] 


Figure 23.14-D: Matrix M where M; į; = c(i ^ j), the number of bits in the intersection of the bitsets i 
and j. Dots denote zeros. 


The weighted OR-convolution keeps track of the number of bits that overlap. We quantify the overlap by 
the bit-count of ^j, see figure|23.14-D| Only the zero entries give contributions to the subset convolution. 


Now set w = exp (27 i/ L) (a primitive L-th root of unity) where L = 1 + log,(N) and N is the length of 


the transforms. We compute the subset convolution s as 


$ = h (wi) (23.14-6) 


The implementation uses a unweighted OR-convolution for the case w? = 1 [FXT: walsh/subset- 


convolution.h : 


1 
2 
3 


template «typename Type» 
inline void subset convolution(Type *f, Type *g, ulong ldn) 
// Compute the subset convolution h[] of f[] and g[]: 
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// h[k] = sum( j subset k, f[jl*g[k-jl ) 
// Type must allow conversion to and from type Complex. 
// Result written to gl]. 


{ 
const ulong n = 1UL << ldn; 


Complex *fc, *gc, *hc; 
fc = new Complex[n]; 
gc = new Complex[n]; 
hc = new Complex[n]; 


// w^0: 

copy_cast(f, fc, n); 
copy.cast(g, gc, n); 

or convolution(fc, gc, ldn); 
acopy(gc, hc, n); 


// w1,w2, ... , w(L-1): 

const ulong L = ldn + 1; 

const Complex w = SinCos( 2*M PI/(double)L ); 
Complex wp = 1.0; // powers of w 

for (ulong j=1; j<L; **j) 

1 


copy_cast(f, fc, n); 

copy.cast(g, gc, n); 

wp *= w; 

weighted_or_convolution(fc, gc, ldn, wp); 

for (ulong k-0; k<n; ++k) hc[k] += gc[kl; 
} 


const double x = 1.0/(double)L; 
for (ulong k-0; k<n; ++k) hc[k] *= x; 
for (ulong k-0; k<n; ++k) glk] = (Type)hc[k].real(); 


delete [] fc; 
delete [] gc; 
delete [] hc; 


Relation |23.14-6]is the special case e = 0 of 


s(e) = 


ST 


L—1 
S wtih (wt) (23.14-7) 
j=0 


Where s(e) is the convolution over subsets that share e elements: 


s(e); = 5 a; bj (23.14-8) 


iVj=T, c(iAj)=e 


There are several ways to avoid usage of the complex domain. Relation [23.14-7|is essentially a Fourier 
sum and we could recast the algorithm in terms of a Hartley transform. This avoids the complex domain 
but still uses real (inexact) arithmetic. 


As L is small we can explicitly represent any number y = aa y; w as polynomials modulo z^ — 1. The 
additions are element-wise, and a multiplication by w is a cyclic shift. This avoids inexact computations 
but needs space O (nlog(n)). Another approach (suggested in [53]) is to compute the L transforms of 
the subsequences of a and b where the bit-count is constant: let a‘) the sequence defined by 


(e) eae ay if c(a;) =p f 
| i { 0 otherwise (23.14-9) 


then 
(e) 
se) = [|Y- | Y y+ [a]. y+ [oe | (23.14-10) 
j=0 
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Chapter 24 


The Haar transform 


Haar transforms are invertible transforms that do not involve trigonometric factors. We present several 
variants of the transform whose computation involve just O(n) operations. Haar transforms can be used 
as building blocks of the Walsh transform. We describe the prefix transform and its convolution, and give 
two non-standard splitting schemes for Haar transforms, based on the Fibonacci and Mersenne numbers. 


24.1 The ‘standard’ Haar transform 


Q: [+++ +++ t+ ++ +e eet tee e etter eee et + + + ++] 1/sqrt (32) 
1: [++ +++ ++ +++ +++ --------------- ] 1/sqrt (32) 
2: [+++ ££ 44 ----- --- ] 1/sqrt(16) 
3: [ ++++++++-------- ] 1/sqrt(16) 
4: [++++---- ] 1/sqrt(8) 
5: [ ++++---- ] 1/sqrt(8) 
6: [ ++++---- ] 1/sqrt(8) 
7: [ ++++----] 1/sqrt(8) 
8: [++ -- ] 1/sqrt (4) 
9: [ pia ] 1/sqrt(4) 
10: [ Rs ] 1/sqrt(4) 
dido E qox ] 1/sqrt(4) 
123 OL ++ -- ] 1/sqrt (4) 
13: [ pe ] 1/sqrt (4) 
14: [ oe de aaa 
15: [ **--] 1/sqrt(4) 
16: [+ - ] 1/sqrt (2) 
17: [ TR ] 1/sqrt(2) 
18: [ +- ] 1/sqrt(2) 
19: [ +- ] 1/sqrt(2) 
20: [ + - ] 1/sqrt (2) 
21: [ +- ] 1/sqrt(2) 
22: [ +- ] 1/sqrt(2) 
23: [ +- ] 1/sqrt(2) 
24: [ +- ] 1/sqrt(2) 
25: [ +- ] 1/sqrt(2) 
26: [ +- ] 1/sqrt(2) 
27: [ +- ] 1/sqrt(2) 
28: [ Te ME cae 
29: [ +- ] 1/sqrt(2) 
30: [ +- ] 1/sqrt(2) 
31: [ + -] 1/sqrt (2) 


Figure 24.1-A: Basis functions for the Haar transform. Only the signs of the nonzero entries are shown. 
The absolute value of the nonzero entries in each row is given at the right. The norm of each row is one. 


The Haar transform of a length-n sequence f consists of log;(n) steps where the sums and differences of 
adjacent pairs of elements f2;, foj+1 are computed. The sums are then written to the lower half of the 
array f, the differences to the upper half. Ignoring the order (and normalization), each step corresponds 
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to a matrix multiplication: 


+1 +1 fo 

Ho f 

+1 41 fo 

Hoc fs 
TE +1 . | fa (24.1-1) 

+1 -1 fs 

+1 +1 fe 

+1 -1 fr 
The step is applied to the full array, then to the lower half, the lower quarter, ..., the lower four elements, 
the lowest pair (the array length n must be a power of 2). The computational cost of the transform is 
proportional to n + n/2+n/4+...+4+2 which is O(n). The basis functions for the Haar transform 


are shown in figure|24.1-A 


The following implementation involves 2n multiplications by 1/2 which make the transform orthogonal, 
corresponding to a scalar factor of v2 in the relation 24.1-1 


1 template <typename Type» 
2 void haar(Type *f, ulong ldn) 
3 t 
4 ulong n = (1UL««1dn); 
5 const Type s2 = sqrt(0.5); // normalization factor 
6 Type *g = new Typeln]; // scratch space 
7 for (ulong m-n; m>1; m>>=1) // n, n/2, n/4, n/8, ..., 4, 2 
8 1 
ulong mh = (m>>1); 
10 for (ulong j=0, k=O; j<m; j*-2, k++) // sums and differences of adjacent pairs 
11 
12 Type x = f[j]; 
13 Type y = f[j+1]; 
14 g[k] = (x + y) * s2; // sums to lower half 
15 g[mh+k] = (x - y) * s2; // differences to upper half 
16 } 
17 acopy(g, f, m); 
18 l 
19 delete [] g; 
20 } 


We reduce the number of multiplications to n by delaying the multiplications [FXT: haar/haar.h!: 


1 template <typename Type» 

2 void haar(Type *f, ulong ldn, Type *ws=0) 
3 t 

4 ulong n = (1UL««1dn); 

5 Type s2 = sqrt(0.5); 

6 Type v = 1.0; 

7 Type *g = ws; 

8 if ( !ws ) g = new Type[n]; 

9 for (ulong m=n; m>1; m>>=1) 

10 

11 v *= s2; 

12 ulong mh = (m>>1); 

13 for (ulong j=0, k-0; j<m; j+=2, k++) 
14 { 

15 Type x = f[jl; 

16 Type y = f[j+1]; 

17 glk] = x+y; 

18 g[mh+k] = (x - y) * v; 
19 } 

20 acopy(g, f, m); 

21 l 

22 f[0] *= v; // v == 1.0/sqrt (n); 
23 if ( !ws ) delete [] g; 


24.2: In-place Haar transform 


The temporary workspace can be supplied by the caller. 


The inverse Haar transform is computed by using the inverse steps in reversed order: 


1 template <typename Type» 

2 void inverse haar(Type *f, ulong ldn, Type *ws=0) 
3 1 

4 ulong n = (1UL««1dn); 

5 Type s2 = sqrt(2.0); 

6 Type v = 1.0/sqrt (n); 

7 Type *g = ws; 

8 if ( !ws ) g = new Type[n]; 
9 f[0] *= v; 

10 for (ulong m-2; m«-n; m««-1) 
11 

12 ulong mh = (m>>1); 

13 for (ulong j=0, k-0; j<m; j*-2, k++) 
14 1 

15 Type x = f[k]; 

16 Type y = f[mh*k] * v; 
17 gLjl = x+y; 

18 glj*i] = x-y; 

19 } 
20 acopy(g, f, m); 
21 v *= s2; 
22 } 
23 if ( !ws ) delete [] g; 
24 $ 
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A generalization of the steps used in the Haar transform leads to the wavelet transforms treated in 


chapter|27 on page 543 


24.2 In-place Haar transform 


The ‘standard’ Haar transform routines are not in-place, they use a temporary storage. A rather simple 


reordering of the basis functions, however, allows for an in-place algorithm [FXT: haar/haar.h|: 


1 template <typename Type» 

2 void haar inplace(Type *f, ulong ldn) 
3 

4 i ulong n = 1UL<<ldn; 

5 Type s2 = sqrt(0.5); 

6 Type v = 1.0; 

7 for (ulong js=2; js<=n; js<<=1) 

8 1 

9 v *= s2; 

10 for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
11 i 

12 Type x = f[jl; 

13 Type y = f[t]; 

14 f[j] = x+y; 

15 fit] = (x - y) * v; 

16 } 

17 } 

18 f[0] *= v; // v==1.0/sqrt(n); 

19 } 


The basis functions of the transform are shown in figure The routine for the inverse transform is 


1 template <typename Type> 

2 void inverse haar inplace(Type *f, ulong ldn) 
3 1 

4 ulong n = 1UL<<ldn; 

5 Type s2 = sqrt(2.0); 

6 Type v = 1.0/sqrt (1); 

7 f[0] *= v; 

8 for (ulong js-n; js>=2; js>>=1) 

9 1 

10 for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
11 

12 Type x = f[j]; 


13 Type y = f[t] * v; 
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O: [t t * ot oto oo ok ok ob o6 £6 ob à à eet G6 6 4 kb x eet ttt + +i) 1/sqrt (32) 
is [eo ] 1/sqrt(4) 
2: [+ 4 ] 1/sqrt (4) 
3: [ = ] 1/sqrt (4) 
4: [++ ++---- ] 1/sqrt (8) 
5: [ + ] 1/sqrt (4) 
6: [ + = ] 1/sqrt (4) 
Tell B ] 1/sqrt(4) 
8: [+++ + 4+ 4+ +4 -------- ] 1/sqrt (16) 
9: [ + ] 1/sqrt (4) 
10: [ pow ome ] 1/sqrt(4) 
tie [ += ] 1/sqrt (4) 
12: [ E ] 1/sqrt(8) 
13: [ * ] 1/sqrt (4) 
14: [ oem ] 1/sqrt(4) 
15: [ +- ] 1/sqrt(4) 
16: [+++ +++ + * 4 Y 4+ 4+ 4+ 44+ 4+ ---------------c- ] 1/sqrt (32) 
17: [ + ] 1/sqrt(4) 
18: [ poo ] 1/sqrt(4) 
19: [ +- ] 1/sqrt(4) 

20: [ PrE lc ] 1/sqrt (8) 
21: [ - ] 1/sqrt(4) 
22: [ ++-- ] 1/sqrt(4) 
23: [ *- ] 1/sqrt(4) 
24: [ Po SE PA SS m ] 1/sqrt (16) 
25: [ + = ] 1/sqrt (4) 
26: [ + +-- ] 1/sqrt (4) 
27: [ *c- ] 1/sqrt(4) 
28: [ *tttt----] 1/sqrt(8) 
29: [ + - ] 1/sqrt (4) 
30: [ ++--] 1/sqrt(4) 
31: [ + -] 1/sqrt (4) 


OQ JAUN 


Oc AUN 


Figure 24.2-A: Haar basis functions, in-place order. Only the signs of the nonzero entries are shown. 
The absolute value of the nonzero entries in each row is given at the right. The norm of each row is one. 


The in-place Haar transform H; is related to the ‘usual’ Haar transform H by a permutation Py via the 


f[j] = x+y; 
f[t] = x-y; 
} 
v *- s2; 
} 
} 
relations 
H 
H^ 


The permutation Py can be programmed as 


template «typename Type» 
void haar permute(Type *f, ulong n) 


1 


revbin permute(f, n); 


Pg- Hj (24.2-1a) 
He Po (24.2-1b) 


for (ulong m-4; m<=n/2; m*-2) revbin_permute(f+m, m); 


} 


The revbin permutations in the loop do not overlap, so the routine for the inverse Haar permutation is ob- 
tained by simply swapping the loop with the full-length revbin permutation [FXT: perm/haarpermute.h : 


template <typename Type> 


void inverse_haar_permute(Type *f, ulong n) 


{ 


for (ulong m-4; m<=n/2; m*-2) revbin_permute(f+m, m); 


revbin permute(f, n); 


24.3 


: Non-normalized Haar transforms 
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+ 
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+ 
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+ 
+ 
+ 
+ 
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OS ONE UD ME 
= 
+ 
H 
+ 
+ 
I 
1 
1 
i} 


1/sqrt (32) 
1/sqrt (32) 
1/sqrt (16) 
1/sqrt (16) 
1/sqrt (8) 
1/sqrt (8) 
1/sqrt (8) 
1/sqrt (8) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (4) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 
1/sqrt (2) 


Figure 24.2-B: Basis functions of the in-place order Haar transform followed by a revbin permutation. 


In this ordering those basis functions which are identical up to a shift appear consecutively. 


Relation |24.2-1a|tells us that haar() is equivalent to the sequence of statements 


and, by relation |24.2-1b| inverse. haar( is equivalent to 


haar, inplaceO ; 
haar_permute() ; 


inverse haar, permute(); 
inverse haar, inplace(); 


24.3 Non-normalized Haar transforms 


Versions of the Haar transform without normalization are given in [FXT: |haar/haarnn.h|. 


The basis 


functions are the same as for the normalized versions, only the absolute value of the nonzero entries are 


diffe 
t 
V 


1 


rent. 
emplate «typename Type» 
oid haar nn(Type *f, ulong ldn, Type *ws=0) 
ulong n = (1UL««1dn); 
Type *g = vs; 
if ( !ws ) g = new Type[n]; 
for (ulong m-n; m>1; m>>=1) 
ulong mh = (m>>1); 


for (ulong j=0, k-0; j<m; j*-2, k++) 
1 

Type x = f[j]; 

Type y = f[j+1]; 

g[k] = xty; 
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15 glmh+k] = x - y; 
16 
17 acopy(g, f, m); 
18 
19 if ( !ws ) delete [] g; 
20 3j 


The inverse is 


1 template <typename Type» 

2 void inverse haar nn(Type *f, ulong ldn, Type *ws=0) 
3 1 

4 ulong n = (1UL<<ldn) ; 

5 Type s2 = 2.0; 

6 Type v = 1.0/n; 

7 Type *g = ws; 

8 if ( !ws ) g = new Type[n]; 

9 f[0] *= v; 
10 for (ulong m-2; m«-n; m<<=1) 
11 1 
12 ulong mh = (m>>1); 
13 
14 for (ulong j=0, k-0; j<m; j+=2, k++) 
15 1 
16 Type x = f[k]; 
17 Type y = f[mhtk] * v; 
18 gLjl E 
19 g[j+1] = x-y; 
20 } 
21 acopy(g, f, m); 
22 v *= s2; 
23 } 
24 if ( !ws ) delete [] g; 
25 } 

An unnormalized transform that works in-place is 
1 template <typename Type> 

2 void haar_inplace_nn(Type *f, ulong ldn) 
3 

4 i ulong n = 1UL<<ldn; 

5 for (ulong js-2; js<=n; js<<=1) 

6 

T for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
8 

9 Type x = £[jl; 
10 Type y = f£[t]; 
11 f[j] = xt y; 
12 f[t] = x-y; 
13 } 
14 F 
15 } 


The inverse routine is 


1 template <typename Type> 

2 void inverse_haar_inplace_nn(Type *f, ulong ldn) 
3 

4 : ulong n = 1UL<<ldn; 

5 Type s2 = 2.0; 

6 Type v = 1.0/n; 

7 f[0] *= v; 

8 for (ulong js-n; js>=2; js>>=1) 

9 

10 for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
11 { 

12 Type x = f[j]; 

13 Type y = f[t] * v; 

14 £[jl = x + y; 

15 f[t] = x-y; 

16 } 

17 v *= $2; 

18 } 

19 } 


The sequence of statements 
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24.4: Transposed Haar transforms 1 503 


haar_inplace_nn() ; 
haar_permute() ; 


is equivalent to haar_nn(). The sequence 


inverse_haar_permute() ; 
inverse_haar_inplace_nn() ; 


is equivalent to inverse_haar_nn(). 


24.4 Transposed Haar transforms t 
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Figure 24.4-A: Basis functions for the transposed Haar transform. Only the signs of the basis functions 
are shown. At the blank entries the functions are zero. 


Figure shows the basis functions of the transposed Haar transform. The following routine does 
an unnormalized Haar transform. The result is, up to normalization, the same as with inverse haar(). 


The implementation uses a temporary array [FXT: haar/transposedhaarnn.h |: 


template «typename Type» 
void transposed haar nn(Type *f, ulong ldn, Type *ws-0) 
1 

ulong n = (1UL««1dn); 

Type *g = ws; 

if ( !ws ) g = new Type[n]; 

for (ulong m-2; m<=n; m<<=1) 


ulong mh = (m>>1); 
for (ulong j=0, k-0; j<m; j*-2, k++) 
1 


Type x = f[k]; 
Type y = f[mh*k]; 
gLjl = x+y; 
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g[j+1] = x-y; 
acopy(g, f, m); 


if ( !ws ) delete [] g; 


NRE REE 
O OWND ot 


} 


The inverse transform is 


template <typename Type> 
void inverse_transposed_haar_nn(Type *f, ulong ldn, Type *ws-0) 
{ 

ulong n = (1UL««1dn); 

Type *g = ws; 

if ( !ws ) g = new Typeln]; 

for (ulong m=n; m>1; m>>=1) 


OBNDUBWNH 


ulong mh = (m>>1); 
10 for (ulong j=0, k-0; j<m; j*-2, k++) 
11 1 
f[j] * 0.5; 
f[j+1] * 0.5; 
x + y; 
x= y; 


14 grx] 
15 g [mh+k] 


} 
17 acopy(g, f, m); 
} 
19 if ( !ws ) delete [] g; 
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Figure 24.4-B: Basis functions for the transposed in-place Haar transform. Only the signs of the basis 
functions are shown. At the blank entries the functions are zero. 
The following routine does not use a temporary array: 


1 template <typename Type» 
2 void transposed haar inplace nn(Type *f, ulong ldn) 
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3 

4 : ulong n = 1UL<<ldn; 

5 for (ulong js-n; js>=2; js>>=1) 
6 

7 for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
8 

9 Type x = f[j]; 

10 Type y = £[tl; 

11 f[j] = x+y; 

12 fit] = x- y; 

13 } 

14 } 

15 $ 


The sequence of statements 


inverse_haar_permute(); 
transposed_haar_inplace_nn(); 


is equivalent to transposed_haar_nn(). The routine for the inverse transform is 


1 template <typename Type> 

2 void inverse_transposed_haar_inplace_nn (Type *f, ulong ldn) 
3 4 

4 ulong n = 1UL<<ldn; 

5 for (ulong js-2; js<=n; js<<=1) 

6 1 

7 for (ulong j=0, t=js>>1; j<n; jt=js, t+=js) 
8 { 

9 Type x = f[j] * 0.5; 

10 Type y = f[t] * 0.5; 

11 f[j] = x+y; 

12 fit] = x-y; 

13 } 

14 } 

15 } 


24.5 The reversed Haar transform t1 


We give two more variants of the Haar transform, which we call the reversed Haar transform and the 
transposed reversed Haar transform. The basis functions of the reversed Haar transform are shown in 


figure 


Let Hn; denote the non-normalized in-place Haar transform (haar inplace, nn), Hini the transposed 
non-normalized in-place Haar transform (transposed haar, inplace, nn), R the revbin permutation, H 
the reversed Haar transform, and HA, the transposed reversed Haar transform. Then 


H = RHniR (24.5-1a) 
Hy = RHR (24.5-1b) 
HU = RHR (24.5-1c) 
ie = EH OR (24.5-1d) 


Code for the reversed Haar transform [FXT: haar/haarrevnn.h : 


1 template <typename Type» 

2 void haar rev nn(Type *f, ulong ldn) 

3 (t 

4 H const ulong n = (1UL<<ldn); 

5 for (ulong ldm=ldn; ldm>=1; --ldm) 
6 

7 const ulong m = (1UL<<ldm) ; 

8 const ulong mh = (m>>1); 

9 ulong r = 0; 

10 // for (ulong r-0; r<n; r*-m) // almost walsh wak dif2() 
11 { 

12 ulong ti = r; 

13 ulong t2 = r + mh; 


14 for (ulong j=0; j<mh; ++j, ++t1, ++t2) 
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'This is almost the radix-2 DIF implementation for the Walsh transform. The only change is that the line 
for (ulong r=0; r<n; r+=m) was replaced by ulong r = 0. The transform can also be computed via 
the following sequence of statements: 

{ revbin permute(); haar inplace nn(); revbin_permute(); }. 


'The inverse transform is obtained by the equivalent modification with the DIT implementation for the 
Walsh transform and normalization: 


1 template <typename Type» 

2 void inverse haar rev nn(Type *f, ulong ldn) 

3 t 

4 for (ulong ldm-1; ldm«-ldn; ++1dm) 

5 1 

6 const ulong m = (1UL««1dm); 

7 const ulong mh = (m>>1); 

8 ulong r = 0; 

9 // for (ulong r-0; r<n; r*-m) // almost walsh wak dit2() 
10 1 

11 ulong ti = r; 

12 ulong t2 = r + mh; 

13 for (ulong j=0; j<mh; ++j, ++t1, ++t2) 
14 { 


15 Type u = f[t1] * 0.5; 
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16 Type v = f[t2] * 0.5; 
17 f [ti] u + v; 
18 f [t2] u - v; 

} 


21 } 
22 } 
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The reversed transposed Haar transform is, up to normalization, the inverse of haar_rev_nn(). It is 


given in [FXT: haar/transposedhaarrevnn.h|: 


1 template <typename Type> 

2 void transposed_haar_rev_nn(Type *f, ulong ldn) 
3 

4 for (ulong ldm-1; ldm«-ldn; ++1dm) 

5 { 

6 const ulong m = (1UL<<ldm) ; 

7 const ulong mh = (m>>1); 

8 ulong r = 0; 

9 // for (ulong r-0; r<n; r*-m) // almost walsh wak dit2() 
10 

11 i ulong t1 = r; 

12 ulong t2 = r + mh; 

13 for (ulong j=0; j<mh; ++j, ++t1, ++t2) 
14 ab 

15 Type u = f[t1]; 

16 Type v = £[t21; 

17 f[ti] = u + v; 

18 f[t2] = u - v; 

19 } 

20 } 

21 } 

22 } 


The same result could be computed with the following sequence of statements: 
{ revbin permute(); transposed_haar_inplace_nn(); revbin permute(); }. 


The inverse transform is 


1 template <typename Type» 

2 void inverse transposed haar, rev nn(Type *f, ulong ldn) 
3 t 

4 // const ulong n = (1UL««1dn); 

5 for (ulong ldm-ldn; ldm»-1; --1dm) 

6 1 

7 const ulong m = (1UL<<ldm) ; 

8 const ulong mh = (m>>1); 


9 ulong r = 0; 


10 // for (ulong r=0; r<n; r*-m) // almost walsh_wak_dif2() 
11 { 

12 ulong ti = r; 

13 ulong t2 = r + mh; 

14 for (ulong j=0; j<mh; ++j, ttti, ++t2) 
15 { 

16 Type u = f[t1] * 0.5; 

17 Type v = f[t2] * 0.5; 

18 f[ti] = u + v; 

19 f[t2] = u - v; 

20 Jy 

21 } 

22 } 

23 } 


24.6 Relations between Walsh and Haar transforms 


24.6.1 Computing Walsh transforms via Haar transforms 


A length-n Walsh transform can be computed with one length-n Haar transform, one transform of 


n 


length-5, two transforms of length-7, four transforms of length- 2, .. 


4 


., and 7 transforms of length-2. We 


implement the Walsh transform Wp (the one with the Walsh Kronecker base) using the reversed Haar 


transform: 
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Haar transforms: 

H(16) H(8) H(4) H(2) 
AAAAAAAAaaaaaaaa BBBBbbbb CCcc Dd 
AAAAaaaa BBbb Cc 
AAaa Bb 
Aa 

Walsh(16) ="= 1*H(16) + 1*H(8) + 2*H(4) + 4«H(2) 
AAAAAAAAaaaaaaaa 
AAAAaaaaBBBBbbbb 
AAaaCCccBBbbCCcc 
AaDdCcDdBbDdCcDd 


Figure 24.6-A: Symbolic description of how to build a Walsh transform from Haar transforms. 


Transposed Haar transforms: 


H(16) H(8) H(4) H(2) 

Aa 

AAaa Bb 

AAAAaaaa BBbb Cc 

AAAAAAAAaaaaaaaa BBBBbbbb CCcc Dd 
Walsh(16) ="= 1*H(16) + 1*H(8) + 2*H(4) + 4*H(2) 

AaDdCcDdBbDdCcDd 

AAaaCCccBBbbCCcc 

AAAAaaaaBBBBbbbb 

AAAAAAAAaaaaaaaa 


Figure 24.6-B: Symbolic description of how to build a Walsh transform from Haar transforms, trans- 
posed version. 


1 // algorithm WH1: 

2 ulong n = 1UL<<ldn; 

3 haar. rev nn(f, ldn); 

4 for (ulong ldk-ldn-1; 1dk>0; --1dk) 

5 1 

6 ulong k = 1UL << ldk; 

7 for (ulong j=k; j<n; j+=2*k) haar rev nn(f*j, ldk); 
8 } 


The idea, as a symbolic scheme, is shown in pere The scheme obtained by reversing the order of 
the lines is shown in figure It corresponds to the computation of W;, using the transposed version 
of the Haar transform: 

// algorithm WH1T: 

ulong n = 1UL<<ldn; 

for (ulong ldk-1; ldk<ldn; ++1dk) 


ulong k = 1UL << ldk; 
for (ulong j-k; j<n; j+=2*k) transposed haar rev nn(f*j, ldk); 
} 
transposed_haar_rev_nn(f, ldn); 
Two more methods are found by reversing the individual lines of the schemes seen so far, see figure|24.6-C 
These correspond to the computation of the inverse Walsh transform (Wp t= 4 Wp) either as 


0 O UA WME 


// algorithm WH2T: 

ulong n = 1UL<<ldn; 
inverse_transposed_haar_rev_nn(f, ldn); 
for (ulong ldk-ldn-1; 1dk>0; --1dk) 

{ 


oR w Nhe 
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AAAAAAAAaaaaaaaa aaaaaaaaAAAAAAAA 
AAAAaaaaBBBBbbbb bbbbBBBBaaaaAAAA 
AAaaCCccBBbbCCcc ccCCbbBBccCCaaAA 
AaDdCcDdBbDdCcDd dDcCdDbBdDcCdDaA 
WH1 WH2T 
AaDdCcDdBbDdCcDd dDcCdDbBdDcCdDaA 
AAaaCCccBBbbCCcc ccCCbbBBccCCaaAA 
AAAAaaaaBBBBbbbb bbbbBBBBaaaaAAAA 
AAAAAAAAaaaaaaaa aaaaaaaaAAAAAAAA 
WH1T WH2 


Figure 24.6-C: Symbolic scheme of the four versions of the computation of the Walsh transform via 


Haar transforms. 


6 ulong k = 1UL << ldk; 
7 for (ulong j=k; j<n; j+=2*k) 
8 


// algorithm WH2: 
ulong n = 1UL<<ldn; 
for (ulong ldk-1; ldk<ldn; ++1dk) 


{ 
ulong k = 1UL << ldk; 


for (ulong j=k; j<n; j+=2*k) 
} 


inverse haar rev nn(f, ldn); 


CONDOR Ccobr- 


inverse transposed haar rev nn(f*j, ldk); 


inverse haar rev nn(f*j, ldk); 


24.6.2 Computing Haar transforms via Walsh transforms 


The schemes given here are O(n log(n)) and not an efficient method to 


compute the Haar transform 


which is O(n). Instead, they can be used to identify the type of Haar transform that is the building block 


of a given Walsh transform. 


The non-normalized transposed reversed Haar transform can (up to normalization) be computed via 


// algorithm HW1:  transposed haar rev nn(f, ldn); -^- 
for (ulong ldk-1; ldk«ldn; ++1dk) 


1 
ulong k = 1UL << ldk; 


walsh wak(f*k, ldk); 


NOOBW NR 


} 
walsh wak(f, ldn); 


and its inverse as 


// algorithm HW1I: 
walsh_wak(f, ldn); 
for (ulong ldk-1; ldk<ldn; ++1dk) 
1 


inverse transposed haar rev nn(f, ldn); -^ 


ulong k = 1UL << ldk; 
walsh wak(f*k, ldk); 


NOOB WN A 


The non-normalized transposed Haar transform can (again, up to normalization) be computed via 


// algorithm HW2: transposed_haar_nn(f, ldn); ="= 
for (ulong ldk-1; ldk<ldn; ++1dk) 


ulong k = 1UL << ldk; 
valsh_pal(f+k, ldk); 


OQ OAU N e 
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Walsh transform: 

w(16) 
AaDdCcDdBbDdCcDd 
AAaaCCccBBbbCCcc 
AAAAaaaaBBBBbbbb 
AAAAAAAAaaaaaaaa 


Inverse (or transposed) Walsh transforms: 


W(8): W(4): W(2): 
BBBBbbbb CCcc Dd 
BBbbCCcc CcDd 
BbDdCcDd 

BBBBbbbb 

CCccBBbbCCcc 

DdCcDdBbDdCcDd 

Aa AaDdCcDdBbDdCcDd 

AAaa AAaaCCccBBbbCCcc 

AAAAaaaa AAAAaaaaBBBBbbbb 

AAAAAAAAaaaaaaaa AAAAAAAAaaaaaaaa 
Haar (16) -^- W(16) + W(8) + W(4) + W(2) 


Figure 24.6-D: Symbolic description of how to build a Haar transform from Walsh transforms. 


7 walsh_pal(f, ldn); 


and its inverse as 


// algorithm HW2I: inverse transposed haar nn(f, ldn); -^- 
walsh pal(f, ldn); // ="= revbin permute(f, n); walsh wak(f, ldn); 
for (ulong ldk-1; ldk<ldn; ++1dk) 


ulong k = 1UL << ldk; 
valsh_pal(f+k, ldk); 


NOOR Win RA 


} 
The symbolic scheme is given in figure|24.6-D 


24.7 Prefix transform and prefix convolution 


0: CT+t++t+e¢+¢ 4+ o 4+ Go 4+ B B 4+ 4+ 4+ 4 0: [+-- - - ] 
i: [ + + + + + + + +] i- I + - - - ] 
2: E + + + + ] 2: [ + - - Jl 
3: [ + + + +] 3: E + - - J 
4: [ + + ] 4: [ + - ] 
5: [ + + ] 5: [ + - ] 
6: [ * + ] 6: [ + - J 
7: [ + *] Terk + e 
8: [ + ] 8: [ * ] 
9: [ + ] 9: [ + ] 
10: [ * ] 10: [ * ] 
its [ + ] dades E + J] 
12: [ + ] 125. [ * ] 
13: [ + ] 13: [ + ] 
14: [ + J 14: [ + J 
15: [ +] 15: [ +] 


Figure 24.7-A: Basis functions of the prefix transform (left) and its inverse (right). 
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coo 
ti a 
n) "nn 


O 
O 
E 
"n 
LLL IL LL LL LL LLL LL LL 


1 
1 
3 
5 
[7 7 7 T 
9 
B 
D 
F 


H 
"rj 
H 


Figure 24.7-B: Scheme for the prefix convolution (hexadecimal). 


The set of prefixes P(k) of a binary word k contains (k and) all binary words obtained by successively 
removing the highest one. For example, P(22)= P(101102) contains the words 101102 = 22, 001102 = 6, 
000102 = 2, and 000002 = 0. Define the prefix transform c of a by setting 


a = J a (24.7-1) 


ie P(k) 


An algorithm to compute the transform in linear time exploits the fact that i € P(k) implies 2i € P(2k) 
[FX T: haar/prefix-transform.h |: 


template «typename Type» 
void prefix transform(Type *f, ulong ldn) 


for (ulong ldm-1; ldm<=ldn; ++1dm) 
1 
const ulong mh - 1UL «« (1dm-1); 


for (ulong i=0; i«mh; ++i) f[i*mh] += f[il; 
} 


OoN ODT gio b.- 


} 
The basis functions of the transform are shown in figure The inverse transform is 


template <typename Type> 
void inverse_prefix_transform(Type *f, ulong ldn) 


for (ulong ldm-ldn; ldm»-1; --1dm) 
1 


const ulong mh - 1UL «« (1dm-1); 
for (ulong i-0; i«mh; ++i) f[itmh] -= f[il; 


OoN BAUN 


} 


Define the prefix convolution h of two sequences a and b by 


hk = —axbx+ 5 (aj bk + ak b;) (24.7-2) 
jeP(k) 


Figure [24.7-B|shows the semi-symbolic scheme. The computation by definition costs O (n log(n)) oper- 
ations: 


1 template <typename Type» 

2 inline void slow_prefix_convolution(const Type *f, const Type *g, ulong ldn, Type *h) 
3 t 

4 const ulong n = 1UL << ldn; 

5 for (ulong k=0; k«n; ++k) h[k] = f[k] * g[kl; 
6 for (ulong k-1; k<n; ++k) 

T 

8 ulong j = k; 

9 do 

10 1 

11 


j ^» highest one(j); 


CABNIDoBWNMH 
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h[k] += f[k] * gLjl; 
h[k] += f[j] * glk]; 
} 
while ( j ); 
} 
} 
The convolution can be computed in linear time via the prefix transform: 


template <typename Type> 
inline void prefix convolution(Type * restrict f, Type * restrict g, ulong ldn) 
1 

prefix transform(f, ldn); 

prefix transform(g, ldn); 

const ulong n = (1UL««ldn); 

for (ulong k=0; k«n; ++k) gl[k] *- f[k]; 

inverse, prefix transform(g, ldn); 


24.8 Nonstandard splitting schemes 1 

All radix-2 transforms recursively split the length of the array into halves. The size of the transforms 
is limited to power of 2. In a recursive implementation we use the trivial equality 2^ = 2*-! + 2*-1, 
With N; := 2* we have No = 1, and Ny = Ni. + Ny1. We use different recursive schemes to derive 


nonstandard variants of the Haar and Walsh transforms. 


24.8.1 Fibonacci-Haar and Fibonacci- Walsh transform 


0: [****c- cz 6 o BG Go 4+ 9 4+ 4+ cA 4+ c4 4+ 4+ cR 4+ 4 
1: [+-++-+-++-++-+-++-+-+] 
2: [+ - + + - + -+ + -+ + - ] 
3: [+ - + + - + - + ] 
4: [ + - + + - + - + ] 
5: [+ = * * = ] 
6: [ + = + * - ] 
7: [ + E + + -] 
8: [+ = + ] 
9: [ + - * ] 
10: [ * - * ] 
11: [I * - * ] 
12: [ * - * ] 
13: [ + - ] 
14: [ + - ] 
15: [ * » ] 
16: [ + = ] 
17: [ + = ] 
18: [ + = ] 
19: [ + E J 
20: [ + -] 


Rep 
h2 — OM 0 DON - 


Figure 24.8-A: Basis functions for the non-normalized Fibonacci-Haar transform. Only the signs of the 
nonzero entries are shown. At the blank entries the functions are zero. 


We use the Fibonacci numbers Fa = Fn-1 + Fn-1 (where Fo = 0 and Fi = 1) to construct a Fibonacci- 
Haar transform as follows [FXT: haar/fib-haar.h|: 


inline void fibonacci_haar(double *a, ulong f0, ulong f1) 
// In-place Fibonacci-Haar transform of a[0,...,f0-1]. 
// £0 must be a Fibonacci number, fi the next smaller Fibonacci number. 


if ( fO < 2 ) return; 


ulong f2 = fO - f1; 
for (ulong j=0,k=f1; j<f2; ++j,++k) 
{ 


double u = a[j], v = a[k]; 
alj] (u+v) * SQRT1_2; 
alk] (u-v) * SQRT1_2; 
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13 } 
14 fibonacci haar(a, fi, f2); 
15 $ 


Omitting the multiplications with 1/4/2 (=SQRT1_2) gives a non-normalized version of the transform. 
The basis functions for the non-normalized transform with length-21 (= Fg) are shown in figure |24.8-A 


(compare to figure |24.5-A|on page|506). The second row corresponds to the rabbit sequence described in 
section|38.11 on page 753| Figure |24.8-A| was created with the program [FXT: fft /fib-haar-demo.cc|. 


0: [****c- cz 6 o BG Go 4+ 4+ A o A AB B MR 4+ Rcx] 
1: [+-++-+-+ 4-4 4+ -4+ -4+ 4-4-4 4) 
2: [+ - + + - + -+ + -+ + -] 
3: [++ --++ ++ - -++ - -++ ] 
4: [+- = ++- +- - ++- - ++- ] 
5: [+++ ---+++ +++ ---] 
6: [+-+ -+-+-+4+ +-+ +-] 
7: [+ - ++ - + - + ] 
8: [+++4+  ----- ++ +++ ] 
9: [+-++- +--++-++- ] 
10: [ + - + - +- + - + ] 
11: [++ -- - - ++++ -- ] 
12: [ + - - + + +-+- - + ] 
13: [+++ ++ 4+ 4+ 4+ c------- ] 
14: [+-++-+4+-+ +- -+ +-] 
15: [ + - + + - - +- + ] 
16: [ + + - -++ - - ++-- ] 
17: [ + - = ++- - + +- -+ ] 
18: [+++ - -- - -- +++] 
19: [+ - + - +- +- +-+] 
20: [+ - - + - + + - ] 


Figure 24.8-B: Basis functions for the non-normalized Fibonacci-Walsh transform. 


The implementation of the Fibonacci-Walsh transform differs by just one line from the code for the 
Fibonacci-Haar transform [FXT: walsh/fib-walsh.h |: 


1 inline void fibonacci walsh(double *a, ulong f0, ulong f1) 

2 // In-place Fibonacci-Walsh transform of a[0,...,f0-1]. 

3 // £0 must be a Fibonacci number, fi the next smaller Fibonacci number. 
4 

5 if ( fO < 2) return; 

D ulong f2 = fO - f1; 

8 for (ulong j=0,k=f1; j<f2; ++j,++k) 

9 { 

10 double u = a[jl, v = alk]; 

11 alj] = (u+v) * SQRT1_2; 

12 alk] = (u-v) * SQRT1_2; 

13 } 

14 fibonacci walsh(a, fi, £2); 

15 fibonacci walsh(a*fi, £2, f1-f2); // <--= omit line to obtain Haar transform 
16 5 


The basis functions for the length 21 transform are shown in figure |24.8-B| which was created with the 
program [FXT: fft/fib-walsh-demo.cc|. 

One can find Haar-like and Walsh-like transforms for any linear recursive sequence that is increasing. A 
construction for recurrences Ng = Ni. .1 + Ny 1. is considered in [136]. 


24.8.2 Mersenne-Haar and Mersenne-Walsh transform 


For the Mersenne numbers Mp = 2" — 1 we have the recursion My = 2- My. 44-1. This gives the splitting 
scheme in the Mersenne- Walsh transform [FXT: walsh/mers-walsh.h|: 


inline void mersenne walsh(double *a, ulong f0) 

// In-place Mersenne-Walsh transform of a[0,...,f0-1]. 
// £0 must be a Mersenne number. 

// Self-inverse. 


TAWN he 
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Mersenne-Haar Mersenne-Walsh 
0: [ + + + + + + + +] 0: [+ + + + + + + +] 
1: [ + + + + ] 1: [ + + + + ] 
2: [+ - + - + - + -] 2: [+ - + - + - + - ] 
3: [ + + ] 3: [ + + ] 
4: [+ - + - ] 4: [ + + - - + + - -] 
5: [ + - * - ] 5: [ + = + - ] 
6: [ + - + - ] 6: [+ - - + + - - +] 
ME + ] Tl + ] 
8: [+ - ] 8: [+ + + + - - - - ] 
9: [ + = ] 9: — + + = - ] 
10: [ + - ] 10: [* = + - - + - +) 
11: [ + z ] 11: [ * i ] 
12: [ + = ] 12: [s eee 
13: [ * - ] 13: [ + = - + ] 
14: [ + =] 14: [+ = = + $ + =] 


Figure 24.8-C: Basis functions for the non-normalized Mersenne-Haar transform (left) and Mersenne- 
Walsh transform (right). Only the signs of the nonzero entries are shown. At the blank entries the 
functions are zero. 


6 if ( fO < 2) return; 

7 ulong f1 = f0 >> 1; // next smaller Mersenne number 
8 

9 for (ulong j=0,k=f1+1; j<f1; ++j,++k) 

10 { 

11 double u = a[jl, v = alk]; 

12 a[j] = (u+v) * SQRT1_2; 

13 alk] = (u-v) * SQRT1_2; 

14 F 

15 mersenne_walsh(a, f1); 

16 mersenne_walsh(a+f1+1, f1); // <--= omit line to obtain Mersenne-Haar transform 
17 $ 


Figure |24.8-C] (right) gives the basis functions for the non-normalized Mersenne-Walsh transform. The 
Mersenne-Haar transform is obtained by deleting one line as indicated. The implementation is given 
in [FXT: haar/mers-haar.h|, the basis functions of the non-normalized version are shown at the left 


of figure |24.8-C| The figure was created with the programs [FXT: ffft/mers-walsh-demo.cc| and [FXT: 


fft/mers-haar-demo.cc|. Note that both transforms leave the central element unchanged. 
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Chapter 25 


The Hartley transform 


The Hartley transform is a trigonometric transform that maps real data to real data. While the fast 
algorithms for radix-2 can be found without great difficulty, the higher radix algorithms are not obvious. 
Therefore it is appropriate to describe the Hartley transform in terms of the Fourier transform. A method 
for the conversion of FFT algorithms to fast Hartley transform (FHT) algorithms is given. 


We give algorithms for the conversion of Hartley transforms to and from Fourier transforms. We fur- 
ther develop FHT-based convolution routines for complex and real-valued data, and for the negacyclic 
convolution. 


25.1 Definition and symmetries 


The discrete Hartley transform of a length-n sequence a is defined as 


c = Ala (25.1-1a) 
n—1 
1 2Tkx . 2rmrkx 
Ck := ma" c " eani) (25.1-1b) 


This is almost the discrete Fourier transform, but with ‘cos + sin’ instead of ‘cos +i- sin’. The continuous 
Hartley transform is treated in [177]. 


'The Hartley transform of a purely real sequence is purely real: 
Hla] c R fo acR (25.1-2) 
The transform is its own inverse: 
HĪH [la] = a (25.1-3) 


Symmetry is conserved, as with the Fourier transform: the Hartley transform of a symmetric, anti- 
symmetric sequence is symmetric, antisymmetric, respectively. Using the notation from section 


page 428| we have 


^as] = +H [as] = +H [as] (25.1-4a) 
Hlar] = —Hlaa] = — [za] (25.1-4b) 


An algorithm for the fast (n log(n)) computation of the Hartley transform is called a fast Hartley trans- 
form (FHT). 


25.2 Radix-2 FHT algorithms 


25.2.1 Decimation in time (DIT) FHT 


For a length-n sequence a of let X'!/?a denote the sequence with elements a, cosz x/n +, sin m z/n. 
The operator X!/? is the equivalent to the operator S!/? of the Fourier transform algorithms. We use 


E 


=m. ua 
Oe Q2 
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the notation (even) and (odd) as introduced in section on page The radix-2 decimation in time 
(DIT) step for the FHT is 


n/2 


H [a] Leza H [ace | + XPH [a9 (25.2-1a) 


H [a] (79^? "e a [ateven)] -XH [e] (25.2-1b) 


This is the equivalent to relations |21.2-3a] and |21.2-3b on page 412 
Pseudocode for a recursive radix-2 DIT FHT (C++ version in [FXT: fht/recfht2.cc ): 


COON TAJAN 


procedure rec fht dit2(a[l, n, x[]) 
// real a[0..n-1] input 
A real x[0..n-1] result 
real b[0..n/2-1], c[0..n/2-1] // workspace 
real s[0..n/2-11, t[0..n/2-1] // workspace 
# n == i then 
x[0] := a[0] 
return 
} 
nh := n/2; 
ue k:=0 to nh-1 
s[k] := a[2x*k] // even indexed elements 
t[k] := a[2*k*1] // odd indexed elements 


rec_fht_dit2(s[], nh, b[]) 
rec fht dit2(t[], nh, c[]) 


hartley_shift(c[], nh, 1/2) 
de k:=0 to nh-1 


x[k] b[k] + clk]; 
x[k+nh] blk] - c[k]; 


} 


The result is returned in the array in x[]. The procedure hartley shift() implements the operator 
X/2. it replaces element cj, of the input sequence c by cp cos(z k/n) + c. sin(r k/n). As pseudocode: 


ER 
RO0O00 0) OUR O N e 


procedure hartley_shift_05(al[], n) 
// real a[0..n-1] input, result 
1 
nh := n/2 
j := n-1 
de k:=1 to nh-1 
c := cos( PI*k/n ) 
S := sin( PI*k/n ) 
f alk], a[jl + := { c*a[k] + s*a[jl, s*a[k] - c*alj] } // parallel assignment 
j := ja 
F 
} 


C++ implementations are given in [FXT: fht/hartleyshift.h|. A version that exploits the symmetry of 


the trigonometric factors is 


#define Tdouble long double 

#define Sin sinl 

template <typename Type> 

inline void hartley_shift_05_v2rec(Type *f, ulong n) 
{ 


const ulong nh = n/2; 
if ( m=4 ) 
{ 


ulong im=nh/2, jm=3*im; 


10 
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Type fi = f[im], fj = f[jml; 
double cs = SQRT1_2; 


f[im] = (fi + fj) * cs; 
f[jm] = (fi - fj) * cs; 
if ( n»-8 ) 

1 


#undef Tdouble 


#undef Sin 


const Tdouble phiO = PI/n; 

Tdouble be = Sin(phi0), al = Sin(0.5*phi0); al *- (2.0*a1); 
Tdouble s = 0.0, c = 1.0; 

for (ulong i-1, j=n-1, k-nh-1, l=nh+1; i<k; ++i, --j, --k, ++1) 


{ Tdouble tt = c; c -= (al*tt*be*s); s -= (al*s-be*tt); } 


f[i] = fi * (double)c + fj * (double)s; // jjcast 
f[jl = fi * (double)s - fj * (double)c; // jjcast 


//l-2i-*nh; k-j-nh; 


f[k] = fi * (double)s + fj * (double)c; // jjcast 
f[1] = fi * (double)c - fj * (double)s; // jjcast 


Pseudocode for a non-recursive radix-2 DIT FHT: 


procedure fht depth first dit2(a[], ldn) 
// real a[0..n-1] input,result 


1 


n := 2**ldn // length of al] is a power of 2 


revbin permute(a[]l, n) 


"um ldm:=1 to ldn 


m 


r: 


2**1dm 
m/2 
m/4 


=0 to n-m step m 


for j:=1 to m4-1 // hartley shift(a*r*mh,mh,1/2) 


1 
k :- mh- j 
u := alr+mh+j] 
v := a[r+mh+k] 
c := cos(j*PI/mh) 
s := sin(j*PI/mh) 
{ u, v } := { c*u + sev, stu - ctv } // parallel assignment 
alr+mb+j] := u 
a[r+mh+k] := v 
} 
for j:=0 to mh-1 
1 
u :7 alr+j] 
v := a[r*j*mh] 
alr+j] =utyv 
alr+j+mb] := u - v 
} 


The derivation of the ‘usual’ DIT2 FHT algorithm starts by combining the Hartley-shift with the sum/diff- 


operations [FXT: fht/fhtdit2.cc : 
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1 void fht depth first dit2(double *f, ulong ldn) 
2 4 

3 const ulong n = 1UL<<ldn; 

4 

5 revbin permute(f, n); 

6 

7 for (ulong ldm-1; ldm<=ldn; ++1ldm) 

8 1 

9 const ulong m = (1UL««1dm); 

10 const ulong mh = (m>>1); 

11 const ulong m4 = (mh>>1); 

12 const double phiO = M_PI/mh; 

13 

14 for (ulong r=0; r<n; r+=m) 

15 1 

16 { // j= 

17 ulong t1 = r; 

18 ulong t2 = t1 + mh; 

19 sumdiff(f[t1], f[t2]); 
20 + 
21 

22 if ( m4) 
23 1 
24 ulong ti = r + m4; 
25 ulong t2 = ti + mh; 
26 sumdiff(f[t1], f[t2]); 
27 F 
28 
29 for (ulong j=1, k=mh-1; j<k; ++j,--k) 
30 { 
31 double s, c; 
32 SinCos(phiO*j, &s, &c); 
úl ulong tj =r + mh + j; 
35 ulong tk = r + mh +k; 
36 double fj = f[tj]; 
37 double fk = f[tk]; 
38 f[tj] = fj * c + fk * s; 
39 f[tk] = fj * s - fk * c; 
4 ulong ti =r + j; 
42 ulong t2 = tj; // == t1 + mh; 
43 sumdiff(f[t1], f[t2]); 
# ti =r+k; 
46 t2 = tk; // == t1 + mh; 
47 sumdiff(f[t1], f[t2]); 
48 } 
49 } 
50 } 
51. $ 


Finally, as with the FFT equivalent (see section |21.2.1.3| on page |414), the number of trigonometric 
computations can be reduced by swapping the innermost loops [FXT: fht/fhtdit2.cc Tht /fhtdit2.cc cc: 


1 void fht dit2(double *f, ulong ldn) 

2 4 Radix-2 decimation in time (DIT) FHT. 
3 

4 const ulong n = 1UL<<ldn; 

5 

6 revbin_permute(f, n); 

T 

8 for (ulong ldm-1; ldm«-ldn; ++1dm) 
9 

10 const ulong m = (1UL<<ldm) ; 

11 const ulong mh = (m>>1); 

12 const ulong m4 = (mh>>1); 

13 const double phiO = M_PI/mh; 
14 

15 for (ulong r=0; r<n; r+=m) 

16 { 

17 { //j = 

18 ulong t1 = r; 

19 ulong t2 = t1 + mh; 


20 j sumdiff(f[ti], £[t21); 
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22 

23 if ( m4 ) 

24 1 

25 ulong ti = r + m4; 

26 ulong t2 = ti + mh; 

27 sumdiff (£[t1], f[t2]); 
28 } 

29 } 

30 

31 for (ulong j=1, k-mh-1; j<k; ++j,--k) 
32 { 

33 double s, c; 

34 SinCos(phiO*j, &s, &c); 

35 

36 for (ulong r-0; r<n; r+=m) 
37 

38 ulong tj =r * mh + j; 
39 ulong tk = r + mh +k; 
40 double fj = f[tj]; 

41 double fk = f[tk]; 

42 f[tj] = fj * c + fk * s; 
43 f[tk] = fj * s - fk * c; 
45 ulong ti =r + j; 

46 ulong t2 = tj; // == t1 + mh; 
47 sumdiff (f [t1], f[t2]); 
19 ti =r +k; 

50 t2 = tk; // == t1 + mh; 
51 sumdiff (f [t1], f[t2]); 
52 } 

53 } 

54 } 

55 } 


25.2.2 Decimation in frequency (DIF) FHT 
The radix-2 decimation in frequency step for the FHT is (compare to relations [21.2-6a| and |21.2-6b on 


page i15). 
H lar n/2 H [ates 4 aligna (25.2-2a) 


H [o9 AE y jaw? (ater = gera] (25.2-2b) 


Pseudocode for a recursive radix-2 DIF FHT (the C++ equivalent is given in [FXT: fht/recfht2.cc ): 


1 procedure rec fht dif2(a[l, n, x[]) 

2  // real a[0..n-1] input 

1 d real x[0..n-1] result 

5 real b[0..n/2-1], c[0..n/2-1] // workspace 
6 real s[0..n/2-1], t[0..n/2-1] // workspace 
$ if n == 1 then 

9 { 

10 x[0] := a[0] 

11 return 

12 } 

13 

14 nh := n/2; 

là for k:-0 to nh-1 

17 1 

18 S[k] := a[k] // ?left? elements 

19 t[k] := alktnh] // ?right? elements 

20 

23 for k:=0 to nh-1 

24 { s[k], t[k] > := { s[k]+t[x], s[k]-t[k] } // parallel assignment 
25 } 

26 

27 hartley_shift(t[], nh, 1/2) 


10 
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rec fht dif2(s[], nh, b[]) 
rec fht dif2(t[], nh, c[]) 


j:=0 

05 k:=0 to nh-1 
x [jl] = b[k] 
x[j+1] := c[k] 
joe j+2 


Pseudocode for a non-recursive radix-2 DIF FHT (C++ version in [FXT: fht/fhtdif2.cc ): 


procedure fht depth first dif2(a[], ldn) 
// real a[0..n-1] input,result 


1 


} 


n := 2**ldn // length of al] is a power of 2 


for ldm:=ldn to 1 step -1 
1 


for r:-0 to n-m step m 
for j:-0 to mh-1 
1 


a[r*jl 
a[r-*j*mh] 


utv 
u =v 


a[r+3] 
a [r+3+mh] 


for j:=1 to m4-1 
k 


mh - j 


E 


a[r+mh+j] 
a[r+mh+k] 


cos (j*PI/mh) 
sin(j*PI/mh) 


< 


{u,v} :={c*uts*v, s*u-cxv+y} // parallel assignment 


u 
NA 


a[r+mh+j] 
a[r+mh+k] 


} 


revbin permute(a[]l, n) 


The ‘usual’ DIF2 FHT algorithm is again obtained by swapping the inner loops, a C++ implementation 
is [FXT: fht/fhtdif2.cc : 


void fht dif2(double *f, ulong ldn) 
// Radix-2 decimation in frequency (DIF) FHT 


1 


const ulong n = (1UL««ldn); 
for (ulong ldm-ldn; ldm>=1; --1dm) 
1 


const ulong m - (1UL««1dm); 
const ulong mh = (m>>1); 
const ulong m4 = (mh>>1); 
const double phiO = M_PI/mh; 


for (ulong r=0; r<n; r+=m) 


{ // j = 
ulong ti = r; 
ulong t2 = ti + mh; 
sumdiff(f[t1], £[t2]); 
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18 } 

20 a ( m4 ) 

22 ulong ti = r + m4; 

23 ulong t2 = ti + mh; 

24 sumdiff(f[ti], £[t2]); 
y 

26 } 

28 for (ulong j=1, k=mh-1; j<k; ++j,--k) 

30 double s, c; 

31 SinCos(phiO*j, &s, &c); 

33 for (ulong r-0; r<n; r+=m) 

35 ulong tj = r+ mh + j; 

36 ulong tk = r + mh + K; 

3g ulong t1 =r + j; 

39 ulong t2 = tj; // == t1 + mh; 

40 sumdiff (f [t1], f[t2]); 

B ti =r + kK; 

43 t2 = tk; // == t1 + mh; 

44 sumdiff (f [t1], f[t2]); 

46 double fj = f[tj]; 

47 double fk = f[tk]; 

48 f[tj] = fj * c + fk * s; 

49 f[tk] = fj * s - fk * c; 

50 } 

51 } 

52 } 

54 revbin_permute(f, n); 


25.3 Complex FFT by FHT 
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The relations between the Hartley and Fourier transforms can be read off directly from their definitions 
and their symmetry relations. Let c be the sign of the Fourier transform. The Fourier transform of a 


complex sequence d € C is, in terms of the Hartley transform, 


Fla] = 5 (wld +a +o (118 18) 


Written out for the real and imaginary part of d=a+ib (a,b € IR): 


Re F [ar ib] 


I 


NIe w]e 


9m F[a+ ib] (09 +0 +0 (2 [a] — 9L [a])) 


Using the symmetry relations |25.1-4a| and |25.1-4b| on page we recast the equations as 


ReFla+ib] = pias — oba] 
ImF[a+ib] = 5 H [bs + o aal 


Both formulations lead to the very same conversion procedure: 
procedure fht_fft_conversion(a[], b[], n, is) 
for k:=1 to n/2-1 
i t := n-k 


COUR WN e 


(25.3-1) 


(25.3-2a) 


(25.3-2b) 


(25.3-3a) 


(25.3-3b) 
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T as := a[k] + a[t] 

2 aa := alk] - alt] 

0 bs := b[k] + b[t] 

1 ba := b[k] - b[t] 

i aa := is * aa 

ba := is * ba 

16 alk] := 1/2 * (as - ba) 
17 a[t] := 1/2 * (as + ba) 
19 b[k] := 1/2 * (bs + aa) 
20 b[t] := 1/2 * (bs - aa) 
21 } 
22 $ 


The C++ implementations are given in [FXT: fft/fhtfft.cc| for type double and [FXT: fft/fhtcfft.cc| for 
type complex. There are two options to compute a complex FFT by two FHTs, we can compute the 
FHTs either at the beginning of the routine 


fht fft conversion(a[l, b[], n, is) 


1 procedure fft by fhti(a[], b[], n, is) 

2  // real a[0..n-1] input,result (real part) 

3  // real b[0..n-1] input,result (imaginary part) 
4 

5 fht(a[], n) 

6 fht(b[], n) 

T 

8 


} 
or at the end 


1 procedure fft by fht2(a[], b[], n, is) 

2  // real a[0..n-1] input,result (real part) 

3  // real b[0..n-1] input,result (imaginary part) 
4 

5 fht fft conversion(a[l, b[], n, is) 

6 fht(a[l, n) 

7 fht(b[], n) 

8 


The real and imaginary parts of the FFT are computed independently by this procedure. This can be 
very advantageous when the real and imaginary parts of complex data lie in separate arrays. The C++ 


version is given in [FXT: (fft./fhtfft.cc]. 
25.4 Complex FFT by complex FHT and vice versa 


version from section |25.3| and there is nothing new. Really? If one has a type complex version of both 


A complex FHT is simply two FHTs (one of the real, one of the imaginary part). So we can use either 
ps3 
the conversion and the T routine, then the complex FFT can be computed as either 


procedure fft by fhti(c[], n, is) 
// complex c[0..n-1] input,result 


fht(c[], n) 
fht fft conversion(c[], n, is) 


Occ b RA 


or the same with swapped statements. One saves half of the trigonometric computations and book 
keeping. It is easy to derive a complex FHT from the real version, and with a well optimized FHT 
you get an even better optimized FFT. C++ implementations of complex FHTs are given in [FXT: 


fht/cfhtdif.cc) (DIF algorithm), [FXT: fht/cfhtdit.cc| (DIT algorithm), and, for zero padded data, [FXT: 
fht /cfht0.cc . 


The other way round: computation of a complex FHT using FFTs. Let T be the operator corresponding 
to the fht_fft_conversion. The operator is its own inverse: T = T71. We have seen that 


F=H-T and F=T-H (25.4-1) 
Multiply the relations with T and use T'- T = 1 to obtain: 
H=T-F and H=F-T (25.4-2) 


E 
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Hence we have either 
procedure fht_by_fft(c[], n, is) 
// complex c[0..n-1] input,result 


fft(c[], n) 
fht fft conversion(c[], n, is) 


Oobwnr 


} 


or the same thing with swapped lines [FXT: fft/fhtcfft.cc|. The same ideas also work for separate real 
and imaginary parts but in that case one should rather use separate FHTs for the two arrays. 


25.5 Real FFT by FHT and vice versa 


To express the real and imaginary part of a Fourier transform of a purely real sequence a € R by its 


Hartley transform use relations |25.3-2a| and |25.3-2b on page 521| and set b = 0: 


AeF[a] = 5 (H [a] + Hal) (25.5-1a) 
ossa] = 0 5 La] — Hal) (25.5-1b) 


A C++ implementation is [FXT: realfft /realfftbyfht.cc 


void 


3 fht_real_complex_fft(double *f, ulong ldn, int is/*=+1*/) 

1 i fht(f, ldn); 

é const ulong n = (1UL<<ldn); 

$ if ( is>O ) for (ulong i-1,j-n-1; i<j; i++,j--) sumdiff05(f[i], f[j]); 

4 i else for (ulong i-1,j-n-1; i<j; it+,j--) sumdiffO5 r(f[il, f[j]); 


The functions sumdiff05() and sumdiffO5 r() are defined as [FX T: aux0/sumdiff.h 


template «typename Type» 

static inline void sumdiff05(Type &a, Type &b) 
// fa, b} <--| {0.5*(atb), 0.5*(a-b)} 

{ Type t=(a-b)*0.5; at=b; a*=0.5; b=t; } 


template <typename Type> 

static inline void sumdiff05_r(Type &a, Type &b) 
// fa, b} <--| {0.5*(atb), 0.5*(b-a)} 

{ Type t=(b-a)*0.5; at=b; a*=0.5; b=t; } 


00 DONA 


At the end of the procedure the ordering of the output data c = Fla] € Cis 


alo] = Reco (25.5-2) 
a[1 = Re cy 
a[2 = Re Ca 
a[n/2 = Reecnse 
a[n/2 +1 = Jm 64/21 
a[n/2 +2 = Jm Cn/2—2 
a[n/2 +3 = Jm Cn/2-3 
an-—1] = mca 


The inverse procedure is given in [FXT: realfft/realfftbyfht.cc : 


E 


COON O0 A 02 Nr 
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Ih couples vei Éb(doubTa *f, ulong ldn, int is/*=+1*/) 

: const ulong n = (1UL««1ldn); 
if ( is>O ) for (ulong i-1,j-n-1; i<j; it+,j--) sumdiff(f[il, f[j]); 
else for (ulong i=1,j=n-1; i<j; it+,j--) diffsum(f[il, f[j]); 
fht(f,ldn); 


} 
The function sumdiff() is defined in [FX T: aux0/sumdiff.h |: 


template «typename Type» 

static inline void sumdiff(Type &a, Type &b) 
// fa, b} «--| {atb, a-b} 

{ Type t-a-b; at=b; b-t; } 


template «typename Type» 

static inline void diffsum(Type &a, Type &b) 
// fa, b} «--| {a-b, a+b} 

{ Type t-a-b; bt=a; ast; } 


The input has to be ordered as given in relations|25.5-2 on the previous page| The sign of the transform 


(variable is) has to be the same as with the forward version. 


Computation of an FHT using a real-valued FFT proceeds similarly as for complex versions. Let 7T;5. be 
the operator corresponding to the post-processing in fht real complex fft(), and Te2r correspond to 
the preprocessing in fht. complex. real. fft(). That is 


Fer =H > Tear and Fro = Tre H (25.5-3) 
The operators are mutually inverse: Tio. = To and Teor = Toi Multiplying the relations and using 
Troc : Tear = lec2r"* Troc ES! gives 

H= Ter t Fr2e and H = For i Trze (25.5-4) 


25.6 Higher radix FHT algorithms 


Higher radix FHT algorithms seem to get complicated due to the structure of the Hartley shift operator. 
In fact there is a straightforward way to turn any FFT decomposition into an FHT algorithm. 


For the moment assume that we want to compute a complex FHT, further assume we want to use a 
radix-r algorithm. At each step we have r short FHTs and want to combine them to a longer FHT but 


we do not know how this might be done. In section [25.3 on page 521| we learned how to turn an FHT 
into an FFT using the 7-operator. And we have seen radix-r algorithms for the FFT. The crucial idea 


is to use the conversion operator T as a wrapper around the FFT-step that combines several short FFTs 
into a longer one. Turn a radix-r FFT-step into an FHT-step as follows: 


1. Convert the r short FHTs into FFTs (use T' on the subsequences). 
2. Do the radix-r FFT step. 
3. Convert the FFT into an FHT (use T' on the sequence). 
For efficient implementations one obviously wants to combine the computations. 


With a radix-r step the scheme always accesses 2r elements simultaneously. The symmetry of the trigono- 
metric factors is thereby automatically exploited. Splitting steps for the radix-4 FHT and the split-radix 
FHT are given in |317]. 


ONDA 02 R2 — 
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25.7 Convolution via FHT 


The convolution property of the Hartley transform can be stated as 


Hla@b]) = : (1 [a] H [b] — Hla] H [0] + H [a] 4t [0] + 91 [a] H i) (25.7-1) 


or, with c :— #H [a] and d := H [b], written element-wise: 


1 E — 
[ael = 5 (cx dy — ex di + ty dy + % dx) (25.7-2a) 
1 ES u 
Se (o. (dy + di) + ce (de di) (25.7-2b) 
1 "ON 
= 5 (a. (ete d dy (ej. 7) (25.7-2c) 


The latter forms reduce the number of multiplications. When turning the relation into an algorithm, one 
has to keep in mind that both elements y; = H [a € b], and y-¡ must be computed simultaneously. 


For the auto-convolution equation |25.7-2a] becomes: 


H [a & al, Jc (ck + Tx) + E (Ck — a) (25.7-3a) 


1 
= eU; (c£ — ex^) (25.7-3b) 


25.7.1 Algorithms as pseudocode 


The following routine computes the cyclic convolution of two real-valued sequences x[] and y[] via the 
FHT, the array length n must be even: 


procedure fht cyclic convolution(x[l, y[], n) 
// real x[0..n-1] input, modified 
// real y[0..n-1] result 
// transform data: 
fht(x[], n) 
fht(y[], n) 
// convolution in transformed domain: 
j := n-1 
E i:=1 to n/2-1 
xi := x[i] 
xj i= x[j] 
yp := yli] + y[j] // == yljl + yli] 
ym := yli] - yLjl // == -(y[j] - ylil) 


yli] := (xi*yp + xj*ym)/2 
yljl] := (xj*yp - xi*ym)/2 
j := j-1 


y[o] := x[0] * y[0] 
if n>1 then y[n/2] := x[n/2] * y[n/2] 


// transform back: 
fht(y[], n) 


// normalize: 
for i:=0 to n-1 


yli] := yli] / n 
} 


It is assumed that the procedure fht() does no normalization. A routine for the cyclic auto-convolution 
is 


526 
1 procedure cyclic self convolution(x[], n) 
2  // real x[0..n-1] input, result 
3 4 
4 // transform data: 
E fht(x[], n) 
7 // convolution in transformed domain: 
8 j :=n-1 
9 for i:=1 to n/2-1 
10 { 
11 ci := x[i] 
12 cj := x[jl 
13 
14 ti := ci*cj // == cj*ci 
15 t2 := 1/2*(ci*ci-cj*cj)  // == -1/2*(cj*cj-ci*ci) 
16 
17 x[i] := ti + t2 
18 x[j] := t1 - t2 
4B ee ae 
21 } 
22 x [o] := x[0] * x[0] 
2: if n>1 then x[n/2] := x[n/2] * x[n/2] 
26 // transform back: 
27 fht(x[], n) 
28 
29 // normalize: 
30 for i:-0 to n-1 
31 
32 x[i] := x[i] / n 
33 } 
34 y 


For odd n replace the line 
for i:-1 to n/2-1 
by 
for i:-1 to (n-1)/2 
and omit the line 


if n>1 then x[n/2] := x[n/2]*x[n/2] 


in both procedures above. 


25.7.2 C++ implementations 
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The FHT based routine for the cyclic convolution of two real sequences is [FXT: convolution/fhtenvl.cc 


1 

2 A 

3 fht(f, ldn); 

4 fht(g, ldn); 

5 fht convolution core(f, g, ldn); 
6 fht(g, ldn); 

T p 


void fht convolution(double * restrict f, double * restrict g, ulong ldn) 


The equivalent of the element-wise multiplication is given in [FXT: convolution/fhtcnvlcore.cc|: 


1 void 

2 fht convolution core(const double * restrict f, double * restrict g, ulong ldn, 
3 double v/*=0.0*/) 

4  // Auxiliary routine for the computation of convolutions 
5  // via Fast Hartley Transforms. 

6 // ldn := base-2 logarithm of the array length. 

7 // v!=0.0 chooses alternative normalization. 

8 

9 const ulong n = (1UL<<ldn); 

10 

11 if ( v==0.0 ) v = 1.0/n; 

12 

13 glo] *= (v * £[0]); 

14 const ulong nh = n/2; 
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i if ( nh»0 ) 

18 g[nh] *= (v * f[nh]); 

19 v *= 0.5; 

20 for (ulong i-1,j-n-1; i<j; i**,j--) fht mul(f[il, f[j], gli], gljl, v); 
21 } 

22 > 


The auxiliary function fht_mul() is [FXT: |convolution/fhtmulsqr.h 


} 


A yj «- v*( (-yi + yj*xi + (yi + yj)*xj ) == v*C (exi + xj)*yi + (xi + xj)*yj ) 


1 template <typename Type> 

2 static inline void 

3  fht mul(Type xi, Type xj, Type &yi, Type &yj, double v) 
4  // yi «- v*( (yi + yj)*xi + (yi - yj)*xj ) == v*( (xi + x3)*yi + (xi - xj)*yj ) 
5 

6 

7 Type hip = xi, him = xj; 

8 Type si = hip + him, di = hip - him; 

9 Type h2p = yi, h2m = yj; 

10 yi = (h2p * si + h2m * d1) * v; 

11 yj = (h2m * s1 - h2p * d1) * v; 

12 
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A C++ implementation of the FHT based self-convolution is given in [FXT:|convolution/fhtcnvla.cc|. It 


uses the routine [FXT: |convolution/fhtcnvlacore.cc 


1 void 

2 fht_auto_convolution_core(double *f, ulong ldn, 
3 double v/*=0.0*/) 

E // v!=0.0 chooses alternative normalization 

6 const ulong n = (1UL<<ldn); 

7 if ( v==0.0 ) v = 1.0/n; 

8 f[0] *= (v * £[0]); 

9 if ( m=2 ) 

10 1 

11 const ulong nh = n/2; 

12 f[nh] *= (v * f[nh]); 

13 v *- 0.5; 

14 for (ulong i-1,j-n-1; i<j; i++,j--) fht_sqr(fli], f[jl, v); 
15 } 

16 $ 


where [FXT: convolution/fhtmulsqr.h|: 


1 template <typename Type» 

2 static inline void 

3  fht sqr(Type &xi, Type &xj, double v) 

4  // xi <-- v*( 2*xi*xj + xi*xi - xj*xj ) 
5 // xj <-- v*( 2*xi*xj - xi*xi + xj*xj ) 
6 

T 

8 

9 


1 
Type a= xi, b= xj; 
Type si = (a + b) * (a - b); 
a *= b; 
10 a += a; 
11 xi = (atsi) * v; 
12 xj = (a-s1) * v; 
13 $ 


25.7.3 Avoiding the revbin permutations 


The observation that the revbin permutations can be omitted with FFT-based convolutions (see sec- 


tion|22.1.3 on page 442) applies again [FXT: convolution/fhtenvlcore.cc : 


void 


ulong ldn, 


const ulong n = (1UL««1ldn); 
if ( v==0.0 ) v = 1.0/n; 


Re 
ROO 0 IDO NA 


fht_convolution_revbin_permuted_core(const double * restrict f, 
double * restrict g, 


double v/*=0.0*/) 
// Same as fht_convolution_core() but with data access in revbin order. 
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12 glo] *= (v * £[0]); // 0 == revbin(0) 

13 if ( n»-2) g[1] *= (v * £[1]); // 1 == revbin(nh) 
14 

15 if ( n<4 ) return; 

£ v *= 0.5; 

18 const ulong nh = (n>>1); 

19 

20 ulong r-nh, rm-n-1; // nh == revbin(1), ni-1 == revbin(n-1) 
21 fht mul(f[r], f[rml, gir], girml, v); 

33 ulong k=2, km=n-2; 

24 while ( k«nh ) 

25 1 

26 // k even: 

27 rm -- nh; 

28 ulong tr = r; 

29 r^-nh; for (ulong m=(nh>>1); !((r^-m)&m); m>>=1) 4(;) 
30 fht_mul(f[r], f[rm], g[rl, glrm], v); 

31 --km; 

32 ++k; 

33 

34 // k odd: 

35 rm += (tr-r); 

36 r += nh; 

37 fht_mul(f[r], flrm], gir], glrml, v); 

38 --km; 

39 ++k; 

40 } 

41 } 


The optimized version saving three revbin permutations is [FXT: |convolution/fhtcnvl.cc : 


1 void fht convolution(double * restrict f, double * restrict g, ulong ldn) 
2 t 

3 fht dif core(f, ldn); 

4 fht dif core(g, ldn); 

5 fht_convolution_revbin_permuted_core(f, g, ldn); 

6 fht_dit_core(g, ldn); 

t- 


25.7.4 Negacyclic convolution via FHT 


Pseudocode for the computation of the negacyclic auto-convolution via FHT: 


1 procedure negacyclic_self_convolution(x[], n) 
2 // real x[0..n-1] input, result 

3 4 

4 hartley_shift_05(x, n) // preprocess 
fht(x, n) // transform data 

T // convolution in transformed domain: 

8 j := n-1 

" for i:-0 to n/2-1 // here i starts from zero 
11 a := xli] 

12 b := x[j] 

13 

14 x[i] := a*b+(a*a-b*b) /2 

15 x[j] := a*b-(a*a-b*b)/2 

16 j := j-1 

17 F 

18 

19 fht(x, n) // transform back 
20 hartley_shift_05(x, n) // postprocess 
21 } 


C++ implementations for the negacyclic convolution and self-convolution are given in [FXT: 
tion/fhtnegacnvl.cc|. The negacyclic convolution is used for the computation of weighted transforms, for 
example in the MFA-based convolution for real input described in section [22.5.4 on page 453 
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25.8 Localized FHT algorithms 


Localized routines for the FHT can be obtained by slight modifications of the corresponding algorithms 
for the Walsh transform described in section [23.5 on page 468| The decimation in time (DIT) version is 


[FX T: fht/fhtloc2.h : 


1 template <typename Type» 

2 void fht loc. dit2 core(Type *f, ulong ldn) 

3 4 

4 if ( ldn<=13 ) // sizeof(Type)*(2**threshold) <= L1, CACHE, BYTES 
5 

6 fht dit core(f, ldn); 

7 return; 

8 } 

9 

10 // Recursion: 

11 fht dit core 2(f42); // ldm==1 

12 fht dit core 4(f44); // ldm==2 

13 fht dit core 8(f48); // ldm==3 

14 for (ulong ldm-4; ldm<ldn; ++ldm) fht_loc_dit2_core(f+(1UL<<ldm), ldm); 
17 for (ulong ldm-1; ldm<=ldn; ++1dm) 

18 { 

19 const ulong m = (1UL<<ldm) ; 
20 const ulong mh = (m>>1); 
21 hartley_shift_05(f+mh, mh); 
22 for (ulong t1=0, t2-mh; ti«mh; ++t1, ++t2) sumdiff(f[t1], f[t2]); 
23 } 
24 $ 


The routine hartley. shift O5() is described in |25.2.1 on page 515| Choose an implementation that 


uses trigonometric recursion as this improves performance considerably. 


The decimation in frequency (DIF) version is: 


1 template <typename Type» 

2 void fht loc dif2 core(Type *f, ulong ldn) 

3 4 

4 if ( ldn<=13 ) // sizeof(Type)*(2**threshold) <= L1, CACHE, BYTES 
5 

6 fht dif core(f, ldn); 

T return; 

A } 

10 for (ulong ldm-ldn; ldm>=1; --1dm) 

11 { 

12 const ulong m = (1UL<<ldm) ; 

13 const ulong mh = (m>>1); 

14 for (ulong t1=0, t2-mh; ti<mh; ++t1, ++t2) sumdiff(f[t1], f[t2]); 
15 hartley_shift_05(f+mh, mh); 

16 } 

17 

18 // Recursion: 

19 fht dif core 2(f42); // ldm==1 

20 fht dif core 4(f44); // ldm--2 

21 fht dif core 8(f48); // ldm--3 

22 for (ulong ldm-4; ldm<ldn; ++ldm) fht_loc_dif2_core(f+(1UL<<ldm), ldm); 
23 ] 


The (generated) short-length transforms are given in the files [FXT: fht/shortfhtdifcore.h and [FXT: 
fht /shortfhtditcore.h 


. For example, the length-8 decimation in frequency routine is 


template «typename Type» 
inline void 
fht dif core 8(Type *f) 


1 
2 
3 
4 4 

5 Type gO, f0, f1, gl; 

6 sumdiff(f[0], £[4], fO, g0); 
T sumdiff(f[2]1, f[6], f1, g1); 
8 sumdiff (f0, f1); 

9 sumdiff (g0, g1); 

0 Type s1, c1, s2, c2; 

1 sumdiff(f[1], f[5], si, c1); 
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12 sumdiff(f[3], £[7], s2, c2); 
13 sumdiff (s1, s2); 

14 sumdiff(fO, si, f[0], f[1]); 
15 sumdiff(f1, s2, f[2], £[3]); 
16 ci *= SQRT2; 

17 c2 *= SQRT2; 

18 sumdiff(gO, ci, f[4], £[5]); 
19 sumdiff(gi, c2, £[6], f[7]); 
20 > 


An additional revbin permutation is needed if the data is required in order. The FHT can be computed 
by either 

fht loc dif2 core(f, ldn); 

revbin permute(f, 1UL<<ldn) ; 
Or 


revbin permute(f, 1UL<<ldn) ; 
fht loc dit2 core(f, ldn); 


Performance for large arrays is excellent: the convolutions based on the transforms [FXT: 


tion/fhtlocenvl.cc 


void 
loc_fht_convolution(double * restrict f, double * restrict g, ulong ldn) 
1 

fht loc dif2 core(f, ldn); 

fht loc dif2 core(g, ldn); 

fht convolution revbin  permuted core(f, g, ldn); 

fht loc dit2 core(g, ldn); 


OND C' 4i 02 Dwr 


} 


and [FXT: convolution/fhtloccnvla.cc 


void 
loc fht auto convolution(double *f, ulong ldn) 


fht loc dif2 core(f, ldn); 
fht auto, convolution revbin permuted core(f, ldn); 
fht loc dit2 core(f, ldn); 


NO OTRO Na 


gave a significant (more than 50 percent) speedup for the high precision multiplication routines (see 


section |28.2 on page 558) used in the hfloat library [22]. 
25.9 2-dimensional FHTs 


A 2-dimensional FHT can be computed almost as easily as a 2-dimensional FFT, only a simple additional 
step is needed. Start with the row-column algorithm described in section [21.9.1 on page 437| [FXT: 


fht /twodimfht.cc : 


1 void 

2 row_column_fht(double *f, ulong nr, ulong nc) 

3 // FHT over rows and columns. 

4 // nr := number of rows 

E 4 nc := number of columns 

7 ulong n = nr * nc; 

8 

9 // fht over rows: 

10 ulong ldc = ld(nc); 

11 for (ulong k-0; k<n; k+=nc) fht(f+k, ldc); 
12 

13 // fht over columns: 

14 double *w = new double [nr]; 

15 for (ulong k-0; k<nc; k++) skip fht(f*k, nr, nc, w); 
16 delete [] w; 

17 $ 


No attempt has been made to make the routine cache friendly: the routine skip_fht() [FXT: 


fht/skipfht.cc| simply copies a column into the temporary array, computes the FHT and copies the 
data back. [72]: 


This is not yet a 2-dimensional FHT, the following post-processing must be made 


CON DOHA C2 NMR 


Octob. 
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void 

y-transform(double *f, ulong nr, ulong nc) 

// Transforms row-column-FHT to 2-dimensional FHT. 
// Self-inverse. 


ulong rh = nr/2; 
if ( nr&1 ) rh-**; 


ulong ch = nc/2; 
if ( nc&1 ) ch++; 


ulong n - nr*nc; 


for (ulong tr-1, ctr-nc; tr<rh; tr++,ctr+=nc) // ctr-nc*tr 
T double *pa = f + ctr; 
double *pb = pa + nc; 
double *pc = f + n - ctr; 
double *pd = pc + nc; 
for (ulong tc-1; tc<ch; tc++) 
t pat*; pb-- i pc++; pd-- : 
double e = (*pa + *pd - *pb - *pc) * 0.5; 
*pa -= e; *pb += e; *pc += e; *pd -= eœ; 
} 
j 


} 
The canned routine for the 2-dimensional FHT is 


void 
twodim_fht(double *f, ulong nr, ulong nc) 
{ 
row column fht(f, nr, nc); 
y_transform(f, nr, nc); 
} 


25.10 Automatic generation of transform code 


FFT generators are programs that output FFT routines, usually for short lengths. The considerations 
given here are not restricted to FFT codes. However, routines that can be unrolled like those for fast 
transforms, matrix multiplication, or convolution are prime candidates for automated generation. 


Algorithmic knowledge can be built into code generators, but we restrict our attention to a simpler 
method known as partial evaluation. Writing such a program is easy: take an existing FFT and change 
all computations into print statements that emit the necessary code. The process, however, is less than 
delightful and very error-prone. 


It would be much better to have a program that reads the existing FFT code as input and writes the 
code for the generator. Let us call this a meta-generator. Implementing such a meta-generator is highly 
nontrivial. It requires writing a parser for the used language, and also data flow analysis. A practical 
compromise is a program that, while theoretically not even close to a meta-generator, creates output that 
is a usable generator code. 


One should print the current values of the loop variables of the original code as comments at the beginning 
of a block. That way it is possible to identify the corresponding parts of the generated code and the 
original file. In addition, one may keep the comments of the original code. 


With FFTs it may be necessary to identify the trigonometric values that occur in the process in terms of 
the corresponding sine and cosine arguments as rational multiples of m. These values should be inlined 
to some greater precision than actually needed to avoid the generation of multiple copies with differences 
only due to numeric inaccuracies. Printing the arguments, both as they appear and in lowest terms, 
inside comments helps to understand and further optimize the generated code: 


double c1=.980785280403230449126182236134; // == cos(Pi*1/16) == cos(Pi*1/16) 
double s1=.195090322016128267848284868476; // == sin(Pi*1/16) == sin(Pi*1/16) 
double c2=.923879532511286756128183189397; // == cos(Pi*2/16) == cos(Pi*1/8) 
double s2=.382683432365089771728459984029; // == sin(Pi*2/16) == sin(Pi*1/8) 


Automatic verification of the generated codes against the original is a mandatory part of the process. 


E 


00 DO daOA wn 


00000 cb RA 
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A level of abstraction for the array indices is of great use: when the print statements in the generator 
emit some function of the index instead of its plain value, it is easy to generate modified versions of the 
code for permuted input. That is, instead of 


cout << "sumdiff(f0, f2, g[" << kO << "], g[" << k2 << "]);" << endl; 
cout << "sumdiff(f1, £3, gl" << k1 << "], gl" << k3 << "]);" << endl; 


use 


cout << "sumdiff(f0, £2, " << idxf(g,k0) << ", " << idxf(g,k2) << ");" << endl; 
cout << "sumdiff(f1, £3, " << idxf(g,k1) << ", " << idxf(g,k3) << ");" << endl; 


where idxf(g, k) can be defined to print a modified (for example, revbin-permuted) index k. 


Here is a generated length-8 DIT FHT core as an example [FXT: fht/shortfhtditcore.h : 


template <typename Type> 
inline void fht dit core 8(Type *f) 
// unrolled version for length 8 


1 // start initial loop 
{//fi=0 gi=1 
Type g0, fO, f1, gl; 
sumdiff(f[O], f[1], f0, g0); 
sumdiff(f[2], f[3], f1, g1); 
sumdiff(fO, f1); 
sumdiff(gO, g1); 
Type si, ci, s2, c2; 
sumdiff(f[4], f[5], si, c1); 
sumdiff(f[6], f[7], s2, c2); 
sumdiff(si, s2); 
sumdiff(f0, si, f[O], f[4]); 


sumdiff(fi, s2, f[2], f[6]); 
c1 *= SQRT2; 
c2 *= SQRT2; 


sumdiff (g0, ci, f[1], £[51); 
sumdiff(gi, c2, f[3], £[71); 


} 
} // end initial loop 


} 
// opcount by generator: #mult=2=0.25/pt  ttadd=22=2.75/pt 


Generated DIF FHT codes for lengths up to 64 are given in [FXT: fht/shortfhtdifcore.h . 


The generated codes can be useful to spot parts of the original code that allow further optimization. 
Especially repeated trigonometric values and unused symmetries tend to be apparent in the unrolled 
code. 


It is a good idea to let the generator count the number of operations (multiplications, additions, loads 
and stores) of the code it emits. Those numbers can be compared to the corresponding values found in 
the compiled assembler code. 


The GCC compiler can produce the assembler code with the original source interlaced. This is a great 
tool for code optimization. The necessary commands are (include and warning flags omitted) 


m create assembler code: 
+ -S -fverbose-asm -g -02 test.cc -o test.s 


$ create asm mterlaced with source lines: 
as -alhnd test.s > test.lst 


For example, the generated length-4 DIT FHT core from [FXT: fht/shortfhtditcore.h| is 


template <typename Type> 

inline void fht_dit_core_4(Type *f) 

// unrolled version for length 4 

E Type f0, f1, £2, f3; 
sumdiff (f [0], f[1], fO, f1); 
sumdiff (f [2], f[3], f2, f3); 
sumdiff(fO, f2, f[0], f[2]); 

" sumdiff(f1, £3, f[1], £[31); 


With Type set to double the generated assembler is, after some editing for readability, 


OONDOBRWNEH 
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void fht dit core 4(double *f) 
1 


double fO, fi, f2, f3; 

sumdiff(f[O], f[1], fO, f1); 
movlpd (%rdi), %xmm1 #* f, tmp63 
movlpd 8(%rdi), AxmmO +, tmp64 

sumdiff(f[2], f[3], f2, £3); 
movlpd 16(%rdi), %xmm2 +, tmp67 
movsd %xmmi, %xmm3 $ tmp63, fO 
subsd %xmm0, %xmmi $ tmp64, f1 
movsd %xmm2, %xmm4 $ tmp67, f2 
addsd %xmm0, %xmm3 # tmp64, f0 
movlpd 24(frdi), %xmmO #, tmp68 


addsd %xmm0, %xmm4 # tmp68, f2 

subsd ‘%xmm0, %xmm2 # tmp68, f3 
sumdiff(fO, f2, £[0], f[2]); 

movsd %xmm3, %/xmm0 # £0, tmp71 

addsd %xmm4, %xmm0 # £2, tmp71 

subsd ‘%xmm4, %xmm3 # £2, £0 

movsd %xmm0, (%rdi) # tmp71,* f 
sumdiff (f1, £3, f[1], £[31); 

movsd %xmmi, %xmmO0 # f1, tmp73 

subsd ‘%xmm2, %xmmi # f3, f1 

movsd %xmm3, 16(%rdi) # fO, 

addsd %xmm2, %xmm0 # f3, tmp73 

movsd %xmmi, 24(%rdi) # fi, 

movsd  /xmm0, 8(Ardi) # tmp73, 


} 


Note that the assembler code is not always in sync with the corresponding source lines, especially with 
higher levels of optimization. 


25.11  Eigenvectors of the Fourier and Hartley transform 1 
Let ag := a 4- à be the symmetric part of a sequence a, then 

F[Flas]] = as (25.11-1) 
Now let uy, := ag + Flas] and u_ := ag — Flag], then 


Fluy] = Flas]+as = as+Flas] = +1-u4 (25.11-2a) 
Flu-] = Flas]-as = -(as- F|as]) = -1-u- (25.11-2b) 


Both uy and u_ are symmetric. For a4 :— a — à, the antisymmetric part of a, we have 


F[F[aa]] = -a4 (25.11-3) 
Therefore with v4} :— aA +iFlaa] and v. :=4A —iFlaa]: 
Floy] = Flaa]—taa = -i(aatiFlag]) = —i- v4 (25.11-4a) 
Flv-] = Flaa]+iaa = &i(aa—iF|aA]) = +i: v- (25.11-4b) 


Both v, and v_ are antisymmetric. The sequences uy, u_, v4, and v_ are eigenvectors of the Fourier 
transform, with eigenvalues +1, —1, —i and +i respectively. The eigenvectors are pair-wise orthogonal. 
Using the relation 


1 
Gol a (uy +u- d vp d v) (25.11-5) 


we can, for a given sequence, find a transform that is a ‘square root’ of the Fourier transform: compute 
uy, U_, v4, and v. , and a transform FA [a] for A € R as 


V^ fa 5 (ce uy + (71) u. + (0 v4 + (+i) v-) (25.11-6) 
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This transform is called the fractional (order) Fourier transform (but see section [22.6.3| on page |456). 
Then F° [a] is the identity and F! [a] is the usual Fourier transform. The transform FT? [a] is a transform 
so that 1/2 [F1/? [a]] = F[ a], that is, a “square root’ of the Fourier transform. The transform F"? [q] 


is not unique as the expressions +11/2 and +:i1/? are not. 


A set of eigenvectors (that is, eigenfunctions) of the continuous Fourier transform is given by 
Hy, exp(—x?/2) (25.11-7) 


where H,, is the n-th Hermite polynomial, see figure on page[696| The corresponding eigenvalues 
are i^. The functions are the eigenstates of the quantum mechanical harmonic oscillator, see [358] entry 
“Quantum oscillator"]. 


'The eigenvectors of the Hartley transform are 


uy := atH {al (25.11-8a) 
u. :— a-Hi{a (25.11-8b) 
The eigenvalues are +1, we have H [uy] = +1-u4 and H [u_] =-—1-u-. 


Let M be the n x n matrix corresponding to the length-n Fourier transform with o = +1, that is, 
Myr =1/yn exp (2m ir c/n). Then its characteristic polynomial (see relation |42.5-2| on page [899] is 


pa) = (2-1 10+9/4 (g 4 1) *2)/41 (g — ilt] (q + il] (25.11-9) 
We write p(x) = x" + Cp-1 et +... peqz-4 co. The trace of the matrix M is 


n—1 
1 
T(M) = — X exp(2mik?/n (25.11-10) 
uuu oem 
It equals (—c,-1, the negated sum of all roots of p(x), and) 
14i, +1, 0, +i (25.11-11) 


for n mod 4 = 0, 1, 2, 3, respectively. A closed form is (1-- $7") / (1— i). The generating function for 
the sequence is ((1+4)—2)/(1+(-1+1)1-—i2?). 


The determinant of M equals ((-1)” co, (—1)" times the product of all roots of p(x), and) 
+i, +1, —1, —i, —i, —1, +1, +i (25.11-12) 


for n mod 8 = 0, 1, 2, ..., 7. The generating function for the sequence is (i +2—27—- ix?) / (1 + os 


Let H be the n x n matrix corresponding to the length-n Hartley transform, that is, Hp. = 
1/4n (cos (27 r c/n) + sin(2xrc/n)). Then its characteristic polynomial is 


pla) = (x— per) (a+ 1j 8-279] (25.11-13) 
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Chapter 26 


Number theoretic transforms 
(NTTs) 


We introduce the number theoretic transforms (NTTs). The routines for the fast NTTs are rather 
straightforward translations of the FFT algorithms. Radix-2 and radix-4 routines are given, there should 
be no difficulty to translate any given complex FFT into the equivalent NTT. For the translation of 
real-valued FFT (or FHT) routines, we need to express sines and cosines in modular arithmetic, this is 


presented in sections |39.12.6| and [39.12.7 


As no rounding errors occur with the underlying modular arithmetic, the main application of NTTs is 
the fast computation of exact convolutions. 


26.1 Prime moduli for NTTs 


We want to implement FFTs in Z/mZ (the ring of integers modulo some integer m) instead of C, the 
field of complex numbers. These FFTs are called number theoretic transforms (NTTs), mod m FFTs or 
(if m is a prime) prime modulus transforms. 


There is a restriction for the choice of m: for a length-n NTT we need a primitive n-th root of unity. A 
number r is called an n-th root of unity if r” = 1. It is called a primitive n-th root if r" Z 1 Vk « n (see 


section on page (774). 


In C matters are simple: e” is a primitive n-th root of unity for arbitrary n. For example, e 
is a primitive 21st root of unity. Now r = e?7*/3 is also 21st root of unity but not a primitive root, 
because r? = 1. A primitive n-th root of 1 in Z/mZ is also called an element of order n. The ‘cyclic’ 
property of the elements r of order n lies in the heart of all FFT algorithms: r"*^ = r*, 


E2mi/n 214/21 


In Z/mZ things are not that simple: for a given modulus m primitive n-th roots of unity do not exist for 
arbitrary n. They only exist for some maximal order R and its divisors d;: rE/di is a d;-th root of unity 
because (rP/di)d: = rF? = 1. Therefore n, the length of the transform, must divide the maximal order R. 
This is the first condition for NTTs: 


n \ R (26.1-1) 


The operations needed in FFTs are modular addition, subtraction and multiplication, as described in 
section on page Division is not needed, except for the division by n in the final normalization. 
Division by n is multiplication by the inverse of n, so n must be invertible in Z/mZ. 


'Therefore n, the length of the transform, must be coprime to the modulus m. This is the second condition 
for NTTs. 


ecd(n,m) = 1 (26.1-2) 
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We restrict our attention to prime moduli, though NTTs are also possible with composite moduli. If 
the modulus is a prime p, then Z/pZ is the field F, = GF(p): all elements except 0 have inverses and 
‘division is possible’. Thus the second condition (relation [26.1-2) is trivially fulfilled for all NTT lengths 
n « p: a prime p is coprime to all integers n < p. 


Roots of unity are available for the maximal order R = p— 1 and its divisors: Therefore the first condition 
(relation [26.1-1) is that n divides p — 1. This restricts the choice for p to primes of the form p = vn 4 1: 
for length-n = 2* NTTs one will use primes like p = 3- 5- 277 + 1 (31 bits), p = 13- 228 + 1 (32 bits), 
p=3-29-2% + 1 (63 bits) or p = 27 - 25% + 1 (64 bits). 


arg 1: 62 == wb [word bits, wb<=63] default=62 

arg 2: 0.01 == deltab [results are in the range [wb-deltab, wb]] default=0.01 
minb = 61.99 = wb-0.01 

arg 3: 44 == minx [log 2(min(fftlen))] default=44 


as = ewe 
4580495072570638337 = 0x3f91300000000001 = 1 + 2744 * 83 * 3137 (61.9902 bits) 
4581058022524059649 = 0x3f93300000000001 = 1 + 2^44 * 3 * 11 * 13 * 607 (61.9904 bits) 
4582113553686724609 = 0x3f96f00000000001 = 1 + 2744 * 3 * 7 * 79 * 157 (61.9907 bits) 
4585702359639785473 = Ox3fa3b00000000001 = 1 + 2744 * 3^2 * 11 * 2633 (61.9918 bits) 
4587039365779161089 = 0x3fa8700000000001 = 1 + 2^44 * 7 * 19372 (61.9923 bits) 
4587391209500049409 = Ox3fa9b00000000001 = 1 + 2744 * 3 * 17 * 5113 (61.9924 bits) 
4588130081313914881 = Ox3facb000000000012 1 + 2744 * 3 * 5 * 17387 (61.9926 bits) 
[unis cur d D = 0x3fb1700000000001 = 1 + 2^44 * 11 * 37 * 641 (61.9931 bits) 
--snip-- 
4610999923171655681 = Ox3ffd9000000000012 1 + 2744 * 5 * 19 * 31 * 89 (61.9998 bits) 
4611105476287922177 = Ox3ffdf00000000001 = 1 + 2744 * 262111 (61.9998 bits) 
Rx es i em 
4580336742896238593 = 0x3f90a00000000001 = 1 + 2745 * 29 * 6772 (61.9902 bits) 
4581533011547258881 = 0x3f94e00000000001 = 1 + 2°45 * 3 * 5 * 8681 (61.9905 bits) 
4584347761314365441 = 0x3f9ee00000000001 = 1 + 2^45 * 5 * 11 * 23 * 103 (61.9914 bits) 
4587655092290715649 = Ox3faaa00000000001 = 1 + 2^45 * 3 * 7^2 * 887 (61.9925 bits) 
[--snip--] 
---- x = 48: ----- 
4585508845593296897 = Ox3f3000000000001 = 1 + 2748 + 11 * 1481 (61.9918 bits) 
oy e E EDD ON 
4582975570802900993 = 0x3f9a000000000001 = 1 + 2749 * 7 * 1163 (61.991 bits) 
4595360469778169857 = 0x3£c6000000000001 = 1 + 2749 * 372 + 907 (61.9949 bits) 
c fy E pul 
4601552919265804289 = Ox3fdc000000000001 = 1 + 2750 * 61 * 67 (61.9968 bits) 


Figure 26.1-A: Primes suitable for NTTs of lengths dividing 2*4. 


modulus (hex) == factorization + 1 log (m-1)/log (2) 
Ox3f40f80000000001 == 2743.372.572.772.4T*1 61.9831 
Ox3cOeb50000000001 == 2740.373.572.7^73.1T*1 61.9083 
Ox3d673d0000000001 == 2°740.372.5°3.7°2.73+1 61.9402 
Ox3fc22b0000000001 == 2740.372.572.7^2.379*1 61.9945 
Ox3bf6190000000001 == 2740.372.573.7.499*1 61.906 
Ox3did690000000001 == 2740.372.572.7.2543141 61.9335 
Ox3d8c270000000001 == 2740.372.572.7.13.197*1 61.9436 
0x3e8e8d0000000001 == 2740.372.572.7.19.137*1 61.9671 
Ox3ee4af0000000001 == 2740.372.572.7.2617*1 61.9748 
Ox3ed23a0000000001 == 2741.372.5°2.7.1307+1 61.9732 
Ox3fafb60000000001 == 2741.372.574.7.53*1 61.9929 
0x3c46140000000001 == 2742.373.572.7.11.19*1 61.9135 
0x3e32440000000001 == 2742.372.572.7.64T*1 61.9588 
0x3d23900000000001 == 2°44.3°3.5°2.7.53+1 61.934 


Figure 26.1-B: Primes suitable for NTTs of lengths dividing 2% 3? 5? 7. 


Primes suitable with NTTs (sometimes called FFT-primes) can be generated with the program [FXT: 


mod/fftprimes-demo.cc|. A shortened sample output is shown in figure|26.1-A| A few moduli that allow 


for transforms of lengths dividing 2% - 3? . 5? . 7 are shown in figure 26.1-B| the data is taken from [FXT: 
mod/moduli.txt|. We note that primality of moduli suitable for NTTs can easily by tested using Proth's 
theorem, see section |39.11.3.1| on page 
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26.2 Implementation of NTTs 


To implement NTTs (modulo m, length n), we need to implement modular arithmetic and replace 
et ?7i/^ by a primitive n-th root r of unity in Z/mZ in the code. A C++ class implementing modular 


arithmetic is [FXT: class mod in mod/mod.h!. 


For the inverse transform one uses the (mod m) inverse r^! of r that was used for the forward transform. 
The element r-! is also a primitive n-th root. Methods for the computation of the modular inverse are 


described in section |39.1.4| on page (GCD algorithm) and in section |39.7.4| on page (powering 


algorithm). 


While the notion of the Fourier transform as a ‘decomposition into frequencies’ appears to be meaningless 
for NTTs the algorithms are denoted with ‘decimation in time/frequency’ in analogy to those in the 
complex domain. 


The nice feature of NTTs is that there is no loss of precision in the transform as with the floating-point 
FFTs. Using the trigonometric recursion in its most naive form is mandatory, as the computation of 
roots of unity is expensive. 


26.2.1 Radix-2 DIT NTT 


Pseudocode for the radix-2 decimation in time (DIT) NTT (to be called with 1dn-1og2(n)): 


procedure mod fft dit2(f[], ldn, is) 
// mod type f[0..2**1dn-1] 
1 
n :- 2**1dn 
rn := element of order(n) // (mod type) 
if is«0 then rn := rn**(-1) 
revbin permute(f[], n) 
"ud ldm:-1 to ldn 
m :- 2**1dm 
mh := m/2 
dw := rn**(2**(ldn-1dm))  // (mod type) 
w := 1 // (mod_type) 
for j:=0 to mh-1 
{ 
for r:=0 to n-m step m 
{ 
ti r +j 
t2 := ti + mh 
v := f[t2] xw // (mod type) 
u :7 f[t1] // (mod type) 
f[t1] :=u * v 
f[t2] := u- v 
} 
w := w * dw // trig recursion 
} 


} 
} 


As shown in section |21.2.1 on page 412|it is a good idea to extract the 1dm==1 stage of the outermost 


loop: Replace 
for ldm:-1 to ldn 
£ 
by 
for r:=0 to n-1 step 2 


{ £[r], f[r*1] > := € f[r]+f[r+1], f[rl-f[r*i] } // parallel assignment 
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for ldm:=2 to ldn 
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The C++ implementation is given in [FXT: intt/nttdit2.cc|: 


1 void 
2 ntt_dit2_core(mod *f, ulong ldn, int is) 
3 // Auxiliary routine for ntt_dit2() 
4 // Decimation in time (DIT) radix-2 FFT 
5  // Input data must be in revbin_permuted order 
6  // ldn := base-2 logarithm of the array length 
7  // is := sign of the transform 
8 4 
9 const ulong n = 1UL<<ldn; 
10 
11 for (ulong i=0; i<n; i*-2) sumdiff(f[i], f[i*1]); 
12 
13 for (ulong ldm=2; ldm«-ldn; ++1dm) 
14 1 
15 const ulong m = (1UL««1dm); 
16 const ulong mh = (m>>1); 
17 
18 const mod dw = mod::root2pow( is>0 ? ldm : -ldm ); 
19 mod w = (mod::one); 
20 
21 for (ulong j=0; j<mh; ++j) 
22 { 
23 for (ulong r=0; r<n; r+=m) 
24 
25 const ulong ti =r + j; 
26 const ulong t2 = tí + mh; 
27 
28 mod v = f[t2] * w; 
29 mod u = f[t1]; 
30 
31 f[ti] = u + v; 
32 f[t2] = u - v; 
33 } 
34 w *= dw; 
35 } 
36 } 
37 } 
1 void 
2 ntt dit2(mod *f, ulong ldn, int is) 
3 de Radix-2 decimation in time (DIT) NTT 
4 
5 revbin_permute(f, 1UL<<ldn) ; 
6 ntt_dit2_core(f, ldn, is); 
7 


The elements of order 2* are precomputed at initialization of the mod class. The call to mod: :root2pow() 


is a simple table lookup. 


26.2.2 Radix-2 DIF NTT 


Pseudocode for the radix-2 decimation in frequency (DIF) NTT: 


1 procedure mod fft dif2(f[], ldn, is) 
2  // mod type £[0..2**1dn-1] 

3 4 

4 n := 2**1ldn 

5 dw := element_of_order(n) // (mod_type) 
6 

7 if is«0 then dw := rn**(-1) 

j for ldm:-ldn to 1 step -1 

10 1 

11 m :- 2**1dm 

12 mh :- m/2 

13 

14 w := 1 // (mod type) 

là for j:-0 to mh-1 

17 1 


18 for r:-0 to n-m step m 


CONDE WMH 
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1 
ti: rf 
t2 := t1 + mh 
v := f[t2] // (mod type) 
u := f[ti] // (mod type) 
f[ti] := u+ v 
f[t2] := (u - v) * w 
} 
w := w * dw // trig recursion 
} 
dw := dw * dw 


} 


revbin_permute(f[], n) 


} 


As in section|21.2.2 on page 414] extract the 1dm==1 stage of the outermost loop: replace the line 


for ldm:=ldn to 1 step -1 
by 
for ldm:=ldn to 2 step -1 


and insert 
for r:-0 to n-1 step 2 


{ £[r], f[r+1] } := { f[r] + f[r+1], flr] - f[r+1] } // parallel assignment 


before the call of revbin_permute (f [] ,n). 


The C++ implementation is given in [FXT: ntt/nttdif2.cc |: 


void 

ntt_dif2_core(mod *f, ulong ldn, int is) 

// Auxiliary routine for ntt_dif2(). 

// Decimation in frequency (DIF) radix-2 NTT. 
// Output data is in revbin_permuted order. 

// ldn := base-2 logarithm of the array length. 
// is := sign of the transform 


{ 
const ulong n = (1UL««ldn); 


mod dw = mod::root2pow( is>0 ? ldn : -ldn ); 
for (ulong ldm-ldn; ldm>1; --1dm) 
t 


const ulong m = (1UL««1dm); 
const ulong mh = (m>>1); 
mod w = mod::one; 
for (ulong j=0; j<mh; ++j) 
{ 
for (ulong r=0; r<n; r+=m) 
{ 
const ulong ti =r + j; 
const ulong t2 = t1 + mh; 
mod v = f[t2]; 
mod u = f[t1]; 
f[t1] = (u + v); 
f[t2] = (u - v) * w; 
} 
w *= dw; 
dw *= dw; 


} 


for (ulong i=0; i<n; i+=2) sumdiff(f[i], f[i*1]); 
} 


void 
ntt_dif2(mod *f, ulong ldn, int is) 
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// Radix-2 decimation in frequency (DIF) NIT 


ntt dif2 core(f, ldn, is); 
revbin permute(f, 1UL<<ldn) ; 


26.2.3 Radix-4 NTTs 


'The radix-4 versions of the NTT are straightforward translations of the routines that use complex num- 
bers. We simply give the C++ implementations 


26.2.3.1 Decimation in time (DIT) algorithm 


Code for a radix-4 decimation in time (DIT) NTT [FXT: ntt/nttdit4.cc|: 


static const ulong LX - 2; 


void 
ntt dit4 core(mod *f, ulong ldn, int is) 
// Auxiliary routine for ntt dit4() 


// Decimation in time (DIT) radix-4 NTT 
// Input data must be in revbin permuted order 


// ldn := base-2 logarithm of the array length 
// is := sign of the transform 
1 


const ulong n = (1UL<<ldn); 


if ( 1dn £ 1)  // n is not a power of 4, need a radix-2 step 


1 

for (ulong i=0; i«n; i*-2) sumdiff(f[il, f[i*1]); 
} 
const mod imag = mod::root2pow( is>0 ? 2 : -2 ); 


ulong ldm = LX + (1dng1); 


for ( ; 
1 


ldm 


<=ldn ; ldm+=LX) 


const ulong m - (1UL««1dm); 
const ulong m4 = (m>>LX); 


const mod dw = mod::root2pow( is>O ? ldm : -ldm ); 


mod 
mod 
mod 


for 


{ 


w= 
w2 
w3 


(ul 


for 


w2 
w3 


(mod: : one) ; 


=W; 
=W; 
ong j=0; j<m4; j++) 

(ulong r=0, i0=jtr; r<n; r+=m, i0+=m) 
const ulong ii = iO + m4; 
const ulong i2 = il + m4; 
const ulong i3 = i2 + m4; 
mod aO = f[i0]; 

mod a2 = f[il] * w2; 

mod ai = f[i2] * w; 

mod a3 = f[i3] * w3; 

mod t02 = a0 + a2; 

mod t13 = al + a3; 

f[iO] = t02 + t13; 

f[i2] » t02 - t13; 

t02 = a0 - a2; 

t13 = al - a3; 

t13 *- imag; 

f[ii] = t02 + t13; 

f[i3] = t02 - t13; 
= dw; 
=w * W; 
=w * w2; 
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62 } 
63 } 


void 
ntt_dit4(mod *f, ulong ldn, int is) 
d Radix-4 decimation in time (DIT) NTT 


revbin permute(f, 1UL««1dn); 
ntt dit4 core(f, ldn, is); 


NOD OTR WD 


26.2.3.2 Decimation in frequency (DIF) algorithm 


Code for a radix-4 decimation in frequency (DIT) NTT [FXT: ntt/nttdif4.cc : 


static const ulong LX = 2; 


1 

3 void 

4 ntt dif4 core(mod *f, ulong ldn, int is) 

5 // Auxiliary routine for ntt dif4(). 

6  // Decimation in frequency (DIF) radix-4 NTT. 

7  // Output data is in revbin permuted order. 

8  // dn := base-2 logarithm of the array length. 
9 


// is := sign of the transform 
10 4 
11 const ulong n = (1UL««1dn); 
13 const mod imag = mod::root2pow( is>0 ? 2 : -2 ); 
15 for (ulong ldm-ldn; ldm»-LX; ldm-=LX) 

1 

17 const ulong m = (1UL<<ldm) ; 
18 const ulong m4 = (m>>LX); 
20 const mod dw = mod::root2pow( is>0 ? ldm : -ldm ); 
21 mod w = (mod::one); 
22 mod w2 = w; 
23 mod w3 = w; 
25 for (ulong j=0; j<m4; j++) 
26 { 
27 for (ulong r-0, i0-j*r; r<n; r+=m, i0+=m) 
29 const ulong ii = iO + m4; 
30 const ulong i2 = il + m4; 
31 const ulong i3 = i2 + m4; 
33 mod a0 = f[i0]; 
34 mod ai = f[il]; 
35 mod a2 = f[i2]; 
36 mod a3 = f[i3]; 
at mod t02 = a0 + a2; 
39 mod t13 = al + a3; 
41 f[iO] = (t02 + t13); 
42 f[ii] = (t02 - t13) * w2; 
4 t02 = a0 - a2; 
45 t13 = al - a3; 
46 t13 *= imag; 
48 f[i2] = (t02 + t13) * w; 
49 f[i3] = (t02 - t13) * w3; 
50 } 
3) w *- dw; 
53 w2-w*wu; 
54 w3 = w * w2; 
55 } 
56 } 
58 if ( ildn & 1) // n is not a power of 4, need a radix-2 step 
59 { 
60 for (ulong i=0; i<n; i+=2) sumdiff(f[i], f[i+1]); 
61 } 


NOOR Cr. 
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void 
ntt_dif4(mod *f, ulong ldn, int is) 
// Radix-4 decimation in frequency (DIF) NTT 


ntt dif4 core(f, ldn, is); 
revbin permute(f, 1UL««1dn); 


26.3 Convolution with NTTs 


The NT'Ts are natural candidates for the computation of exact integer convolutions, as used in high 
precision multiplication algorithms. All computations are modulo m, the largest value that can be 
represented is m — 1. Choosing a modulus that is greater than the maximal possible value of the result 
avoids any truncation. 


If m does not fit into a single machine word, the modular arithmetic tends to be expensive. This may slow 
down the computation unacceptably. It is better to choose m as a product of mutually coprime moduli m; 
that are all just below machine word size, compute the convolutions for each modulus m;, and finally use 
the Chinese Remainder Theorem (see section [39.4 on page 772) to obtain the result modulo m. In [271] 
it is suggested to use three primes just below the word size. This method allows computing convolutions 
(almost) up to lengths that just fit into a machine word. 


Routines for the NT T-based exact convolution are given in [FXT: ntt/nttcnvl.ccj. The routines are 
virtually identical to their complex equivalents given in section on page |441| For example, the 


routine for cyclic self-convolution is 


void 

ntt auto convolution(mod *f, ulong ldn) 

// Cyclic self-convolution. 

// Use zero padded data for linear convolution. 

1 
assert two invertible(); // so we can normalize later 
const int is = +1; 
ntt dif4 core(f, ldn, is); // transform 
const ulong n = (1UL<<ldn); 
for (ulong i-0; i«n; ++i) f[i] *- f[i]; // multiply element-wise 
ntt dit4 core(f, ldn, -is); // inverse transform 
multiply val(f, n, (mod(n)).invQ ); // normalize 

} 


The revbin permutations are avoided as explained in section|22.1.3 on page 442 


For further applications of the NTT see the survey article [172] and the references given there. 
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Fast wavelet transforms 


The discrete wavelet transforms are a class of transforms that can be computed in linear time. We describe 
wavelet transforms whose basis functions have compact support. These are derived as a generalization 
of the Haar transform. 


27.1 Wavelet filters 


We motivate the wavelet transform as a generalization of the ‘standard’ Haar transform given in sec- 
tion|24.1 on page 497| The Haar transform will be reformulated as a sequence of filtering steps. 


We consider only (moving average) filters F defined by n coefficients (filter taps) fo, fi, ..., fn—1- Let 
A be the length-N sequence ao, a1, ..., ay-1. Define F(A) as the weighted sum 
n—-1 
P(A) := Y feng moa N (27.1-1) 
j-0 


That is, F(A) is the result of applying the filter F to the n elements ak, ay 41, p42, ---Ak+n—1, possibly 
wrapping around. 


Now assume that N is a power of 2. Let H be the low-pass filter defined by ho = hy = +1/v2 and G be 
the high-pass filter defined by go = +1/V2, gı = —1/V2. A single filtering step of the Haar transform 
consists of 


e. computing the sums: so = H(A), $9 — H(A), $4 = Ha(4), e. SN-2 = Hy-2(4), 
e computing the differences: do = Go(4), d2 = G3(A), da = Ga(4), ..., dN-2 = Gn_2(A), 


e writing the sums to the left half of A and the differences to the right half: 
A= [So, 52,54, S6; <- -3 SN—2; do, da, da, de, tt) dy 3]. 


The Haar transform is computed by applying the filtering step to the whole sequence, then to its left 
half, then to its left quarter, ..., the left four elements, the left two elements. With the Haar transform 
no wrap-around occurs. 


The analogous filtering step for the wavelet transform is obtained by defining two length-n filters H 
(low-pass) and G (high-pass) subject to certain conditions. We consider only filters with an even number 
n of coefficients. 


Define the coefficients of G to be the reversed sequence of the coefficients of H with alternating signs: 
go=thn-1, gi —ha-2, g2=+hn-3, g4-— —ha-a, +. (27.1-2) 
+>) Un-3 = —ha, Gn-2=+h1, Gn-1 —ho 
We also require that the resulting transform is orthogonal. Let S be the matrix corresponding to one 
filtering step, ignoring the order: 
SA = [so, do, 82, da, 84, d4, 56, dg, essy SN—2,dN—2] (27.1-3) 
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For example, with length-6 filters and N = 16 the matrix S would be 


ho hi ho hg ha h 0 0 0 0 0 0 0 0 O0 
go 91 g 93 Ya g 0 0 0 0 0 0 0 0 OD 
0 0 ho hi he hg ha hh 0 0 0 0 0 0 0 
0 0 go g go g ga 9 0 0 0 0 0 0 0 
0 0 0 0 hg hi h hg ha hs 0 0 0 0 0 
0 0 0 0 90 91 92 93 94 95 0 0 0 0 0 
0 0 0 0 0 0 ho h ha hg ha hg 0 0 0 
g = 09 0 0 0 0 0 go g 9 gs 9 J 0 0 0 
0 0 0 0 0 0 0 0 ho hi ha ha ha hs 0 
0 0 0 0 0 0 0 0 90 91 92 93 94 95 0 
0 0 0 0 0 0 0 0 0 0 ho h ha ha ha 
0 0 0 0 0 0 0 0 0 0 go g 9 93 9 
ha hk 0 0 0 0 0 0 0 0 0 0 hg hi he 
ga 9 0 0 0 0 0 0 0 0 0 0 go g go 
ho hg ha h 0 0 0 0 0 0 0 0 0 0 ho 
g 93 g4 g5 0 0 0 0 0 0 0 0 0 0 go 


Using relation |27.1-2| we have: 

tho +h, +h2 +h3 -ch4 +h5 0 0 0 0 

th; —ha4 +h} —h2 +h, —ho 0 0 0 0 
0 0 tho +h, +h2 +h3 +h4 +hs 0 0 


OS CQ OO 


0 0 
0 0 0 0 ho +h, +h. +hg +h +h; 
0 0 


The orthogonality requires that S ST = id, that is (setting hj =0 for j < 0 and j > n) 
2 
a = 1 
J 
dog hive 
J 
Shas = 0 
j 
In general, we have the following n/2 wavelet conditions: 
Pm 
Zi =i 
j 


X O aghjsee = 0 where ¿=1,2,3,...,n/2—1 
j 


l 
o 


We call a filter H satisfying these conditions a wavelet filter. 


cOococcoccccoccococ 


SOS ©: Oc 


(27.1-4a) 


(27.1-4b) 


(27.1-5a) 
(27.1-5b) 


(27.1-5c) 


(27.1-6a) 


(27.1-6b) 


For the wavelet transform with n = 2 filter taps there is only the condition h2 + h? = 1. It leads to the 
parametric solution ho = sin(¢), hı = cos(@). Setting 9 = 7/4 we find ho = hı = 1/V2, corresponding 


to the Haar transform. 


27.2 Implementation 


A container class for wavelet filters is [FXT: class wavelet filter in wavelet /waveletfilter.h: 


27.2: Implementation 


a class wavelet_filter 

3 public: 

4 double *h_; // low-pass filter 

5 double *g ; // high-pass filter 

6 ulong n_; // number of taps 

7 

E void ctor_core() 

10 h_ = new double[n ]; 

11 g_ = new double[n_]; 

12 } 

13 

14 wavelet_filter(const double *w, ulong n=0) 
15 { 

16 if ( O!=n) n_=n; 

17 else // zero terminated array w[] 

18 1 

19 n = 0; 

20 while ( w[n_]!=0 ) **n ; 

21 } 

22 

23 ctor_core(); 

24 

25 for (ulong i-0, j-n -1; i<n_; ++i, --j) 
26 1 

A h_[i] = wlil; 

29 if ( !(i&1) > g_[j] = -h_[i]; // even indices 
30 else g-[j] = +h_[i]; // odd indices 
31 } 

32 } 

33 

34 [--snip--] 


'The wavelet conditions can be checked via 
bool check(double eps-1e-6) const 


if ( fabs(norm sqr(0)-1.0) > eps ) return false; 


for (ulong i-1; i<n_/2; ++i) 
if ( fabs(norm sqr(i)) > eps ) return false; 


return true; 


where norm sqr() computes the sums in the relations [27.1-6a| and [27.1-6b 


ON O) Cu MN — 


1 static double norm sqr(const double *h, ulong n, ulong s-0) 

2 1 

3 s *= 2; // Note! 

4 if ( s>=n ) return 0.0; 

2 double v = 0; 

7 for (ulong k=0,j=s; j<n; ++k,++j) v += (h[k]*h[j]); 

8 return v; 

9 } 

10 

11 double norm sqr(ulong s-0) const 4 return norm_sqr(h_, n_, s); } 


A wavelet step can be implemented as [FXT: wavelet /wavelet.cc|: 


1 void 

2 wavelet_step(double *f, ulong n, const wavelet_filter &wf, double *t) 

3 4 

4 const ulong nh = (n>>1); 

5 const ulong m = n-1; // mask to compute modulo n (n is a power of 2) 
6 for (ulong i=0,j=0; i<n; it=2,++j) // i Min [0,2,4,..,n-2]; j Nin [0,1,2,..,n/2-1] 
7 

8 i double s = 0.0, d= 0.0; 

9 for (ulong k=0; k<wf.n_; ++k) 

10 { 

11 ulong w = (i+k) & m; 

12 s += (wf.h_[k] * f[w]); 

13 d += (wf.g_[k] * f[w]); 

14 } 
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16 t[nh*j] = d; 

17 } 

18 acopy(t, f, n); //f0 := tO 
19 } 


The wavelet transform itself is 


1 void 
2  wavelet(double *f, ulong ldn, const wavelet filter &wf, ulong minm/*=2*/) 
3 4 
4 ulong n = (1UL««1dn); 
5 ALLOCA(double, t, n); 
6 for (ulong m-n; m»-minm; m>>=1) wavelet step(f, m, wf, t); 
a F 
The step for the inverse transform is [FXT: wavelet /invwavelet.cc|: 
1 void 
2 inverse_wavelet_step(double *f, ulong n, const wavelet_filter &wf, double *t) 
3 4 
4 const ulong nh = (n>>1); 
5 const ulong m = n-1; // mask to compute modulo n (n is a power of 2) 
6 null(t, n); // tO := [0,0,...,0] 
7 for (ulong i=0, j=0; i<n; i+=2, ++j) 
8 { 
9 const double x = f[j], y = f[nh*jl; 
10 for (ulong k-0; k<wf.n_; ++k) 
11 { 
12 ulong w = (itk) € m; 
13 tlw] += (wf.h_[k] * x); 
14 t[w] += (wf.g [k] * y); 
15 } 
16 } 
17 acopy(t, f, n); // £0 := tO 
18 } 
The inverse transform is 
1 void 
2 inverse_wavelet(double *f, ulong ldn, const wavelet filter &wf, ulong minm/*-2*/) 
3 4 
4 ulong n = (1UL««1dn); 
5 ALLOCA(double, t, n); 
6 for (ulong m=minm; m<=n; m««-1) inverse wavelet step(f, m, wf, t); 
TẸ} 


A readable source about wavelets is [357]. 


27.3 Moment conditions 


As the wavelet conditions do not uniquely define the wavelet filters, we can impose additional properties. 
We require that the first n/2 moments vanish for a 2n-tap wavelet filter: 


Y ii = 0 (27.3-1a) 


So (=j Ph; = 0 where k=1,2,3,...,n/2—1 (27.3-1b) 


One motivation for these moment conditions is that for reasonably smooth signals (for which a polynomial 
approximation is good) the transform coefficients from the high-pass filter (the di) will be close to 0. 
With compression schemes that simply discard transform coefficients with small values this is a desirable 
property. 

The class [FXT: class wavelet_filter in wavelet /waveletfilter.h| has a method to compute the mo- 
ments of the filter: 


static double moment(const double *h, ulong n, ulong x-0) 


if ( 0==x ) 


Ou WD =e 


double v = 0.0; 
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27.3: Moment conditions 


for (ulong k-0; k<n; k+=2) 
for (ulong k-1; k<n; k+=2) 


v += hik]; 
v -= hik]; 


} 


return V; 


double dk; 


double ve = 0; 
dk = 2.0; 


for (ulong k-2; k<n; k+=2, dk+=2.0) ve += (pow(dk,x) * h[k]); 


double vo = 0; 
dk = 1.0; 


for (ulong k=1; k<n; k+=2, dk+=2.0) vo += (pow(dk,x) * h[k]); 


return ve - vo; 


double moment (ulong x-0) const 4 return moment(h , n_, x); } 


Filter coefficients that satisfy the moment conditions are given in [FXT: wavelet /daubechies.cc|: 


extern const double Daubi[] = 4 


+7 
+7 


.071067811865475244008443621048e-01, 
.071067811865475244008443621048e-01 }; 


extern const double Daub2[] = 4 


+4 
+8 
+2 
=1 


.829629131445341433748715998644e-01, 
.365163037378079055752937809168e-01, 
.241438680420133810259727622404e-01, 
.294095225512603811744494188120e-01 }; 


extern const double Daub3[] = 4 


+3. 
+8. 
+. 


m 


326705529500826159985115891390e-01, 
068915093110925764944936040887e-01, 
598775021184915700951519421476e-01, 


.350110200102545886963899066993e-01, 
-8. 
+3. 


544127388202666169281916918177e-02, 
522629188570953660274066471551e-02 }; 


extern const double Daub4[] = 4 


+2. 
+7. 
+6. 
=2. 


-1 


m 


303778133088965008632911830440e-01, 
148465705529156470899219552739e-01, 
308807679298589078817163383006e-01, 
798376941685985421141374718007e-02, 


.870348117190930840795706727890e-01, 
+3. 
+3. 
.059740178506903210488320852402e-02 }; 


084138183556076362721936253495e-02, 
288301166688519973540751354924e-02, 


[--snip--] 


extern const double Daub38[] = 4...) 
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The names reflect the number n/2 of vanishing moments. Reversing or negating the sequence of filter 
coefficients leads to trivial variants which also satisfy the moment conditions. 


For the filters of length n > 6 there are solutions that are essentially different. For n = 6 there is one 


complex solution besides Daub3[]: 
-0.09556007476957763 + 0.0508627772544*I 
+0.08121662052705924 + 0.1525883317632*I 
+Q.72145023542906591 + 0.101/255545088*I 
+0 .72145023542906591 - 0.1017255545088*I 
+0 .08121662052705924 - 0.1525883317632*I 
-0.09556007476957763 - 0.0508627772544*I 

For n = 8 there is, besides Daub4[], an additional real solution (left) and a complex one (right): 
-0.07576571478950221 *0.02152475910155493 + 0.018428 *I 
-0.02963552764600249 -0.06571356411493559 + 0.017679 *I 
+0.49761866763277498 -0.19397617446078878 - 0.131995 *I 
+0.803/38/5180513208 +0.24627664139071534 - 0.280171 *I 
*0.29785779560530605 10.85723045931761476 - 0.092141 *I 
-0.09921954357663353 10.59199318/85735184 + 0.206458 *I 
-0.01260396726203130 *0.02232773/722816661 + 0.205709 *I 
*0.03222310060405146 -0.06544948394658407 + 0.056034 *I 

The number of solutions grows exponentially with n (the minimal polynomial of any tap value has degree 


2"). The filters given in [FXT: wavelet /daubechies.cc| are the filters for the Daubechies wavelets (some 


closed form expressions for the filter coefficients are given in [105]). 


Filter coefficients that satisfy the wavelet and the moment conditions can be found by a Newton iteration 


OUR UNA 


DOJRAN 
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for zeros of the function F : R” > R”, F(h) := % where w; = F;(h) = Fi(ho,hi,...,hs). For example, 
with n = 6 the F; are defined by 


F[1]: h0^2 + hi*2 + h2^2 + h3°2 + h472 + hb^2- 1 
F[2]: h2*hO + h3*h1 + h4*h2 + h5*h3 
F[3]: h4*hO + h5*h1 
F[4]: -hO + hi + -h2 + h3 + -h4 + h5 
F[5]: hi + -2*h2 + 3*h3 + -4*h4 + 5*h5 
F[6]: hi + -4*h2 + 9*h3 + -16*h4 + 25*h5 
ES . l dF, . 
The derivative is given by the Jacobi matrix J. It has the components Jp e :— FUN. For n — 6 its rows 
are j 
J[1]= [2*hO, 2*h1, 2*h2, 2*h3, 2*h4, 2*h5] 
J[2]= [h2, h3, hO + h4, hi + h5, h2, h3] 
J[3]= [h4, h5, O, O, hO, h1] 
J[4]= [-1, 1, -1, 1, -1, 1] 
J[5]= [0, 1, -2, 3, -4, 5] 
J[6]= [0, 1, -4, 9, -16, 25] 
Now iterate (the equivalent to Newton's iteration, zy41 :— Zk — f(zx)/ f' (xx)) 
E E = E 
hk+1 = hk -J (hk) F (hx) (27.3-2) 


The computations have to be carried out with a rather great precision to avoid loss of accuracy. 


Part IV 


Fast arithmetic 
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Chapter 28 


Fast multiplication and 
exponentiation 


The usual scheme for the multiplication of two N-digit numbers involves O(N?) operations. We describe 
multiplication algorithms that are asymptotically better than this, the Karatsuba algorithm, the Toom- 
Cook algorithms, and multiplication via FFTs. In addition, the left-to-right and right-to-left schemes for 
binary exponentiation are described. 


28.1 Splitting schemes for multiplication 


Ordinary multiplication is O(N?). Assuming the hidden constant equals 1, the computation of the 
product of two million-digit numbers would require = 10!? operations. On a machine that does 1 
billion operations per second the multiplication would need 1000 seconds. The following schemes lead to 
algorithms with superior asymptotics. 


28.1.1  2-way splitting: the Karatsuba algorithm 
The following algorithm is due to A. Karatsuba and Y. Ofman [200]. 
Split the numbers A and B (assumed to have approximately the same length) into two pieces 
A = za +a, B = zb + bo (28.1-1) 


where x is a power of the radix close to VA (a number with half as many digits as A). The usual multi- 
plication scheme needs four multiplications with half precision for one multiplication with full precision: 


AB = ag: bo + z (ao + by + bo + a1) + z? ay - bi (28.1-2) 


Only the multiplications a; - b; need to be considered. The multiplications by x, a power of the radix, are 
only shifts. If we use the relation 


AB = (1+ z) ag + bg + z (a — ao) - (bo — 51) + (£ + 22)a1 - By (28.1-3) 


we need three multiplications with half precision for one multiplication with full precision. By applying 
the scheme recursively until the numbers to multiply are of machine size, we obtain an algorithm which 
is O(Nlo&2 (2) ~ O(N-585). An alternative form of the splitting scheme is 

AB = (1-z)ag-bo 4 x (a1 -- ao) - (bo - 61) + (a? — x) a4 - by (28.1-4) 


It must be noted that the partial products may produce a carry, so a little bit more than half precision 
is needed with each splitting step. Also note that partial products in the middle of relation [28.1-3| can 
be negative. 


For squaring use either of the following schemes 
A? = (1+2)a2 — x (a1 — a9)? + (x -- 2?) a? (28.1-5a) 
A? = (1-z)ag 4 x (a1 +00) + (x? — z) o2 (28.1-5b) 
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We compute 8231? = 67749361 with the first relation (28.1-5a): 


823172 == (100*82+31)72 

==  (1+100)*3172 - 100*(82-31)^2 + (100+10072)*8272 

== (1+100)*[961] - 100*[2601] + (100+100°2)* [6724] 

== 961 + 96100 - 260100 + 672400 + 67240000 

== 67749361 
Assume that the hidden constant equals 2 as there is more bookkeeping overhead than with the usual 
algorithm. Computing the product of two million-digit numbers would require ~ 2- (10°)1°8° zz 6.47. 10? 
operations, taking about 6.5 seconds on our computer. 


The Karatsuba scheme for polynomial multiplication is given in section [40.2 on page 827 
28.1.2 3-way splitting 


A method that splits U and V into more than two pieces is called a Toom-Cook algorithm (the method 
is called Toom algorithm in [114], and Cook-Toom algorithm in [2]). 


28.1.2.1 Zimmermann's 3-way multiplication 


A = a2*x^2 + al*x + a0 

B = b2*x^2 + bi*x + bO 

SO = a0 * bO 

$1 = (a2+a1+a0) * (b2+b1+b0) 

S2 = (4*a2+2*alta0) * (4*b2+2*b1+b0) 
S3 = (a2-alta0) * (b2-b1+b0) 

S4 = a2 * b2 

T1 = 2x83 + 82 

T1 /= 3 NN division by 3 

T1 += SO 

Ti /= 2 

T1 -= 2*S4 

i = (S1 + 83)/2 

S2 = T2 - SO - 84 

83 = T1 - T2 

P = S4xx"4 + S3*x^3 + S2*x^2 + Sixx + SO 
P - A*B NN == zero 


Figure 28.1-A: Implementation of Zimmermann’s 3-way multiplication scheme in GP. 


A good scheme for 3-way splitting is due to Paul Zimmermann. We compute the product C = A- B of 
two numbers, A and B 


A = aaz?--ai& -- ao (28.1-6a) 
B = boa? +b) z+ bo (28.1-6b) 
C = A-B = c42^-- e 33 -F- es z? -- eq rt 6g (28.1-6c) 


by the following scheme (taken from [104]): set 


So :— ag:bo (28.1-6d) 
Sı := (as +a1 d ao) : (ba + bi + bo) (28.1-6e) 
Sj := (dag +2a, + ao): (405 + 2b; + bo) (28.1-6f) 
S3 :— (a2— a +09): (ba — bı + bo) (28.1-6g) 
Sa := dde (28.1-6h) 


This costs 5 multiplications of length N/3. We already have found cy = Sy and c4 = S4. We determine 
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C1, C2, and c3 by the following assignments (in the given order): 


Ti := 2534+ S$ (= 18c4 + 6c3 + 6 c2 +3060) (28.1-6i) 
Ty := T,/3 (= 6c4 + 2c3 + 2c2 + co) exact division by 3 (28.1-63) 
Ti = TitSo (2664265205 - 209) (28.1-6k) 
T : 2  (=3c4+ c3 + 0 co) (28.1-61) 
T := MD IN ae es co) (28.1-6m) 
Ta := (S14 $3)/2 (= c4 + Ca + co) (28.1-6n) 
$41 = Si-T; (=c) wrong in cited paper (28.1-60) 
S2 := So—S4  (=c2) (28.1-6p) 
53 = Ti— T (= ea) (28.1-6q) 
Now we have 
C = A.B = Sat + 53034 Sox? + 912 ++ Sp (28.1-6r) 


The complexity of recursive multiplication based on this splitting scheme is Ales) ~ N1465, Assume 
that the hidden constant again equals 2. Then the computation of the product of two million-digit 
numbers would require ~ 2- (10%)!-465 ~ 1.23-10° operations, taking about 1.2 seconds on our computer. 


Note the division by 3 in relation |28.1-6j| A division by a constant (that is not a power of 2) cannot be 
avoided in n-way splitting schemes for multiplication for n > 3. There are squaring schemes that do not 
involve such divisions. 


28.1.2.2 3-way multiplication by Bodrato and Zanoni 


A = a2*x^2 + al*x + a0 

B = b2*x^2 + bi*x + bO 

SO = a0 * bO 

$1 = (a2+al+a0) * (b2+b1+b0) 

S2 = (4*a2+2*alt+ta0) * (4*b2+2*b1+b0) 
S3 = (a2-alta0) * (b2-b1+b0) 

S4 = a2 * b2 

S2 = (82 - S3)/3 NN division by 3 
S3 - (S1 - S3)/2 

S1 = si - SO 

$2 = (S2 - S1)/2 

S1 = S1 - $3 - 84 

S2 = S2 - 2*S4 

S3 = 83 - 82 


= $4*x74+ $2*x73+ S1*x72+ S3*x + SO 
P - A*B \\ == zero 


Figure 28.1-B: Implementation of the 3-way multiplication scheme of Bodrato and Zanoni. 


An alternative algorithm for 3-way splitting is suggested in [60]: setup So, S1, ..., $4 as in relations 

. .28-1-6h] then compute, in the given order, 
S2 :— (S2— 53)/3 (= 5c4+3 03 + c2 + c1) exact division by 3 (28.1-7a) 
S3 := ($1, — S3)/2 (= c3 +c) (28.1-7b) 
Sı := S,—Spo (= c4 +03 +02 + c1) (28.1-7c) 
Sa :— (S2— 8$1)/2 (= 2c4- c3) (28.1-7d) 
Sı := $1—5934— 94 (= c2) (28.1-7e) 
S2 := S2-2S4 (= c3) (28.1-7f) 
S3 := $3— 99 (= c1) (28.1-7g) 


28.1: Splitting schemes for multiplication 553 


Now we have (note the order of the coefficients S;) 


C = A.B = Mat - 83a? + Sa? + S3 2 4- So (28.1-7h) 
The scheme requires only one multiplication by 2, while Zimmermann's scheme involves two. 


28.1.2.3 3-way squaring 


The following scheme is taken from [104]. To compute the square C — A? of a number A 


A = aaz?-aiz 4 ao (28.1-8a) 
C = A = Sar’ + Sr? +S? -- 5124S (28.1-8b) 
set 

So = dl (28.1-8c) 

Sy = (az +aı + ag)? (28.1-8d) 

So :— (az-a +a) (28.1-8e) 

$3 = 241 : ag (28.1-8f) 

Sy = a (28.1-8g) 

(28.1-8h) 


This costs four squarings and one multiplication of length N/3. The quantities So, 53, and S4 are already 
correct. Determine Sı and Sa via 


T := ($1 + S2)/2 (28.1-81) 
Si :— 5$1— Ti — S53 (28.1-8j) 
Sa = Ti = Sa m So (28.1-8k) 


28.1.3 4-way splitting 
28.1.3.1 4-way multiplication 


An elegant and clean scheme for 4-way splitting of a multiplication is given by Bodrato and Zanoni in [61]. 
A GP implementation is shown in figure|28.1-C| The algorithm is O(n'*84)) zz O(n1403). In general, an 
s-way splitting scheme will be O(n/(%)) where f(s) = log,(2s + 1). 


28.1.3.2 4-way squaring 
The following scheme is taken from [104] 


A = a3z5-Fasz^--a4 € 4- ap (28.1-9a) 
C= A= egx9 + cT? + csgt eg 3 + cer? +T co (28.1-9b) 
Set 
Sy = a (28.1-9c) 
S2 :— 249:41 (28.1-9d) 
$4 := (dp + a1 — ag — aa): (ag — a1 — a2 + a3) (28.1-9e) 
Sa :— (ap +a, + a2 +43)? (28.1-9f) 
Ss :— 2(ao— aa): (a1 — a3) (28.1-9g) 
Sg :— 2a3-a9 (28.1-9h) 
Br = a (28.1-9i) 
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A = a3*x^3 + a2*x^2 + al*x + a0 
B = b3*x^3 + b2*x^2 + bix*x + bO 
S1 = a3*b3 
S2 = (8*a3+4*a2+2*al+a0) *(8*b3+4*b2+2*b1+b0) 
S3 = (+a3+a2+a1+a0)*(+b3+b2+b1+b0) 
S4 = (-a3+a2-al+a0)*(-b3+b2-b1+b0) 
S5 = (+8*a0+4*a1+2*a2+a3) * (+8*b0+4*b1+2*b2+b3) 
S6 = (-8*a0+4*a1-2*a2+a3) * (-8*b0+4*b1-2*b2+b3) 
S7 = a0*b0 
$2 += S5 
$4 -= S3 
96 -= S5 
S4 /= 2 
$5 -= S1 
$5 -= (64*S7) 
$3 += S4 
S5 *= 2 
S5 += S6 
S2 -= (65*S3) 
$3 -= Si 
$3 -= S7 
94 = -S4 
96 = -S6 
S2 += (45*S3) 
S5 -= (8*S3) 
S5 /= 24 \\ division by 24 
96 -= $2 
S2 -= (16*S4) 
S2 /= 18 \\ division by 18 
$3 -= S5 
$4 -= 82 
S6 += (30*S2) 
S6 /= 60 MX division by 60 
S2 -= S6 
P = S1*x^6 + S2*x^b + S3xx"4 + S4*x73 + Sb*x^2 + S6*x + S7 
P - A*B \\ == zero 
Figure 28.1-C: Implementation of the 4-way multiplication scheme in GP. 
A = a3*x^3 + a2*x^2 + al*x + a0 
S1 = a072 
S2 = 2 *a0 *al 
S3 = (a0 + al - a2 - a3) * (a0 - al - a2 + a3) 
S4 = (a0 + al + a2 + a3 )72 
S5 = 2*(a0 - a2)*(a1 - a3) 
S6 = 2*xa3*a2 
S7 = a3°2 
Ti = $3 + 84 
T2 = (T1 + S5)/2 
T3 = S2 + S6 
T4 = T2 - T3 
T5 = T3 - S5 
T6 = T4 - $3 
T7 = T4 - Si 
T8 = T6 - S7 


P = S7 *x^6 + S6 *x^b + T7 *x^4 + Tb *x73 + T8 *x^2 + S2 *x + Sl 
P- A^2 \\ == zero 


Figure 28.1-D: Implementation of the 4-way squaring scheme in GP. 


28.1: Splitting schemes for multiplication 


Then set, in the given order, 


The square then equals 


C 


28.1.4  5-way splitting 


Sz x? + Sg a9 -- Tr xt +T a? -- Ta c? -- S5 m -- S 


28.1.4.1 5-way multiplication 


The scheme for 5-way splitting of a multiplication shown in figure [28.1-E| is given in [61]. As with the 
4-way multiplication scheme, no temporaries are used. 


28.1.4.2 5-way squaring 


Ti := $34+ 84 
Ta :— (Ti + S5)/2 
Ts := So+ S6 
Ta := TD -Ts 
T; := T3—Ss5 
Te := T4- $3 
Tr := T4—89 
Ts := Tę— S7 


We describe a 5-way squaring scheme given in [60]. Let 


A = azt + azr? + azz? +a T + ao 
C= Æ = cg LË + crx’ + cet? E eg a? + car eg a + ct? +c rt co 
Set 
Sı = aw 
S2 dee a 
Si := (ao +a + az + as + a4)? 
Sa := (ao — a1 + az — a3 + a4)? 
S5 :— 2(a9 —a2+ a4): (a1 — aa) 
Sg :— (ao +41 — ag — a3 + a4) : (ag — a1 — 42 + aa + a4) 
S7 :— (a, +2 — a4): (a, — a2 — a4 + 2 (ao — aa)) 
Sg :— 2a9g:a 
So :— 2a3:04 


Then do the following assignments, in the order given: 


S4 
S3 
S6 
S5 
S4 
S3 
S6 
S5 
S7 
S4 


= (S44 $3)/2 (= co + c2 + c4 + c6 + cg) 
= $3—S4 (= c1 + c3 + e5 + c7) 

= (Sg + S4)/2 (= co + c4 + cg) 

= (—S5 + 93)/2 (= c3 + c7) 

= S$4-— Sg (= Ca + ce) 

= S3 — Ss — Sg (= ex) 

= Se—sS2— sı (= ca) 

= Ss —So (= ex) 

= S7 — S2 — Sg — So + S6 + 93 (= c2) 
= S4- $7 (= ce) 
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(28.1-9j) 
(28.1-9k) 

(28.1-91) 
(28.1-9m) 
(28.1-9n) 
(28.1-90) 
(28.1-9p) 
(28.1-9q) 


(28.1-9r) 


(28.1-10a) 
(28.1-10b) 


(28.1-12i 
(28.1-12j 
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A = adxx"4 + a3xx"3 + a2xx"2 + alx*x + a0 

B = b4xx"4 + b3*x^3 + b2*x^2 + blxx + bO 

S1 = a4*b4 

S2 = (a0-2*a1t+4*a2-8*a3+16*a4) * (b0-2*b1+4*b2-8*b3+16*b4) 
S5 = (a0+2*a1t+4*a2+8*a3+16*a4) * (b0+2*b1+4*b2+8*b3+16*b4) 
S3 = (a4+2*a3+4*a2+8*al1+16*a0) * (b4+2*b3+4*b2+8*b1+16*b0) 
S8 = (a4-2*a3+4*a2-8*a1+16*a0) * (b4-2*b3+4*b2-8*b1+16*b0) 


S4 = (a0+4*a1+16*a2+64*a3+256*a4) * (b0+4*b1+16*b2+64*b3+256*b4) 


S6 = (a0-ai*a2-a3-*a4) * (b0-b1+b2-b3+b4) 
S7 = (a0+aita2+a3+a4) * (b0+b1+b2+b3+b4) 
S9 = a0*bO 

S6 -= ST 

S2 -= S5 

S4 -= S9 

S4 -= (2^16*81) 

S8 -= S3 

S6 /= 2 

S5 *= 2 

S5 += $2 

S2 = -S2 

S8 = -S8 

S7 += S6 

S6 = -S6 

S3 -= ST 

S5 -= (512*S7) 

S3 *= 2 

S3 -= S8 

S7 -= 81 

ST -= S9 

S8 += $2 

S5 += S3 

S8 -- (80*S6) 

S3 -= (510*S9) 

S4 -= $2 

S3 *= 3 

S3 += S5 

S8 /= 180 \\ division by 180 
S5 += (378*87) 

$2 /= 

S6 -= S82 

S5 /= (-72) \\ division by -72 
83 /= (-360) MM division by -360 
S2 -= S8 

ST -= S3 

S4 -- (256*S5) 

S3 -- S5 

S4 -= (4096x83) 

S4 -= (16*S7) 

S4 += (256*86) 

S6 += S2 

S2 *= 180 

S2 += 


S4 
S2 /= 11340 \\ division by 11340 
S4 += (720*S6) 
S4 /= (-2160) MM division by -2160 


P = S1*x^8 + S2*x77 + S3*x76 + S4*x75 + S5*x74 + S6*x73 + S7*x72 + S8*x + S9 
P - A*B \\ == zero 


Figure 28.1-E: Implementation of the 5-way multiplication scheme in GP. 
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A = a4*x^4* a3*x^3 + a2*x^2 + al*x + aO 


Si = a072 
S2 = a472 
S3 = (a0 + al + a2 + a3 + a4)°2 
S4 = (a0 - al + a2 - a3 + a4)°2 
S5 = 2* (a0-a2*a4) * (ai-a3) 
S6 = (a0 + al - a2 - a3 + al) * (a0 - al - a2 + a3 + al) 
S7 = (al + a2 - al) * (al - a2 - a4 + 2x*(a0-a3)) 
S8 = 2*a0*a1 
$9 = 2*a3*a4 
S4 = (S84+S3)/2 
S3 = 53-84 
S6 = (S6+S4) /2 
S5 = (-S5*83)/2 
S4 = 54-S6 
S3 = $3-S5-S8 
S6 = 86-82-81 
S5 = S5-S9 
S7 = $7-S2-S8-S9+S6+S3 
S4 = S4-S7 
P = §$2*x78+S9*x77+54*x7 6+53*x75+S6*x7 44+S5*x73+S7*x72+88*xt+S1 
P - A^2 \\ == zero 
Figure 28.1-F: Implementation of the 5-way squaring scheme. 
A = a4*x^4* a3*x^3 + a2*x^2 + al*x + a0 
Si = a072 
S2 = a472 
L ...S9, as before] 
T1 = Si + 2x82 - S7 + 2*88 + S9 
T2 = 83 - S4 
T3 = 2*S5 
T4 = T2 + T3 
T5 = T2 = T3 
T6 - T4/4 
T7 = T5/4 - S9 
T8 = Ti - T6 - S6 
T9 = T6 - 88 
T10 = S3 
T11 = (T10 + S4 + S6)/4 
T12 = T11 - 81 - 
T13 = (T10 + S5)/2 
T14 = T13 - Ti 


P = S2*x^8 + SO*x^T + T8*x76 + TO*x^b + T12x*x"4 + T7*x73 + Ti4*x72 + S8*x + Si 
- A°2 NN == zero 


Figure 28.1-G: Implementation of the alternative 5-way squaring scheme in GP. Definition of $},... „S9 


as in figure [28.1-F 


Now we have (note the order of the coefficients S;) 


C = A? = Szr? + Soa! + Saa? + S30°4 Ser’ + Ssa°+$707+Sg2+S, (28.1-12k) 


The following scheme is taken from [104], with some errors in the paper corrected. Setup $1,...,99 as 


558 Chapter 28: Fast multiplication and exponentiation 


given by relations |28.1-11a} . .|28.1-11i| then compute, in the given order, 


Ti := Spo, Bpod So (28.1-13a) 
To := S3- S4 (28.1-13b) 
TS de (28.1-13c) 
"y = Ht (28.1-13d) 
JL = b-h (28.1-13e) 
Te := Ui (28.1-13f) 
Ty := T/4— So (28.1-13g) 
Ju = cem 8s (28.1-13h) 
d des quod (28.1-13i) 
Tio de S+S (28.1-13j) 
Tu := (Tio S4 4 S6)/4 (28.1-13k) 
T2 := Tu-841— Se (wrong in cited paper) (28.1-131) 
Tis := (Tip +5s)/2 (28.1-13m) 
Ta Ep. (28.1-13n) 


We have (note that the coefficients for z^ and x? are wrong in the cited paper): 
8 pap 


C = Sox? + Sox’ Tya? +e tee + T7 x’ +Tax? +See +5 (28.1-130) 


28.2 Fast multiplication via FFT 


We describe the FFT-based algorithm for multiplication of two numbers. To keep matters simple, we 
only consider numbers of the same length N. 
28.2.1 Numbers are almost polynomials 
An N-digit integer A written in radix R as 
GN—1 QN—2 ... Q2 Q1 ao (28.2-1) 


denotes a quantity of 
N-1 
Y aj R = aya RO ay s RN? +... ay Rt ao (28.2-2) 
1=0 


The digits can be identified with coefficients of a polynomial in R. For example, with decimal numbers 
we have R = 10 and the number 578 equals 5 - 10? 4- 7 - 10! + 8- 109. The product of two numbers is 
almost the polynomial product 


2N-2 


Y aR = x FE j E 
k = a Ri. M bj R (28.2-3) 
k=0 i=0 j=0 


As the cz can be greater than R—1 (the nine for radix R), the result has to be fixed using carry operations: 
go from right to left, replace cy by cj, = ck mod R and add (cz — c;)/ R to its left neighbor. 


An example: usually we would multiply the numbers 82 and 34 as follows: 


82 x 34 
3 372 8 
2 24 6 


=2 7 8 8 
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The carries can be delayed to the end of the computation: 


82 x 34 
32 8 

24 6 

24 38 8 


=2 77 38 8 


The computation before the carrying is a polynomial multiplication: 


(8a +2) x  (31+4) 
322 8 
24 x? 62x 
= 241? +382 +8 


The value of the polynomial 24 z? + 38 x + 8 for z = 10 is 2788. 


28.2.2 Polynomial multiplication is linear convolution 


The c; in relation |28.2-3| can be found by comparing coefficients: they must satisfy the equation 


Ch de J ay (28.2-4) 


itj=k 


This is equation |22.1-8| on page a linear convolution: multiplication of two numbers is a linear 
convolution (polynomial multiplication) of the digit sequences, followed by carries. 


In section [22.1.2| on page we have seen that the convolution of two sequences A and B can be 
computed as follows: 


1. Transform: A:=FFT(A) and B :—FFT(B). 


2. Multiply the transformed sequences element-wise: C:= A.B. 


3. Transform back: C :=FFT (6). 


The scheme is equivalent to the following: 


1. Evaluate: evaluate both polynomials A and B at sufficiently many points. Let the sequences of 
evaluations be A and B. 


ES 


2. Multiply the evaluations element-wise: C :— A.B. The sequence C contains the values of the 
polynomial C. 


3. Interpolate: find the polynomial C corresponding the sequence of values e 


If we use the roots of unity as the points of evaluations, then the FF'T can be used to evaluate the 
polynomials and the inverse FFT for the interpolation, both with complexity O (N log N). The FFT is 
a fast algorithm to evaluate a polynomial of degree n at the n-th roots of unity in parallel. You might 
be surprised if you thought of the FFT as an algorithm for the decomposition into frequencies. There is 
no problem with either of these notions. 


Re-launching our example (82 - 34 = 2788), we use the fourth roots of unity +1 and +i = +y-—1: 


A=(81+2 x B=(31+4) C=AB 
+1 +10 +7 +70 
+1 +81 + 2 +374 4 +382 — 16 
-1 —6 +1 —6 
—i —8i +2 —3i +4 —38i — 16 


C = (2427 + 382 +8) 


DO) OA an 
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Read this table as follows: First the given polynomials A and B are evaluated at the points given in the 
left column, thereby the columns below A and B are filled. Then the values are multiplied to fill the 


column below C, giving sequence C of the values of C at the points. Finally, the actual polynomial C 
is found from those values, resulting in the lower right entry. We use a GP script that does the Fourier 
transform by definition: 


C=[+70, +38*1-16, -6, -38*1-16] 
\\ Fourier transform by definition: 
{ forstep (k=3, 0, -1, \\ highest to lowest 
ck=sum(j=0,3, C[j*il*exp(-2*Pi*I*k*j/4)); AN inverse transform: negative sign 
print( ck/4 ); \\ with normalization 
2; Y 
The output is 
-46687853312469 E-37x*1 


The product of two polynomials of degree n and m has degree m +n and we need at least m +n point of 
evaluation. An N-digit number corresponds to a polynomial of degree N —1. Therefore, when multiplying 
two N-digit numbers, we need at least 2N — 2 points of evaluation (this is why we zero-pad the sequences 
for linear convolution, see relation [22.1-7a] on page |444). In the example above we could have used only 
three points of evaluation, but the evaluations at the third roots of unity would have given noninteger 
evaluations, making the table harder to read. 


The operation count is dominated by that of the FFTs: the element-wise multiplication is of course 
O(N), so the whole fast convolution algorithm is O(N log(N)). The following carry operation is also 
O(N) and can therefore also be neglected. 


Assume the hidden constant equals 5. Multiplying our million-digit numbers will need about 
5-10°log,(10®°) = 5-10%.20 = 10% = 0.1-10° 


operations, taking approximately a tenth of a second on our computer. 


We note that the complexity O(N log N) is not exactly the truth, it has to be O(N log(N)f(N)) for 
some very slowly growing function f. For example, f(N) = loglog N with the Schónhage-Strassen 
multiplication algorithm given in [302], see also [358] entry “Schónhage-Strassen algorithm"]. Several 
multiplication algorithms are given in ch.4.3.3]. See [224] on how far the idea “polynomials for 
numbers" can be carried and where it fails. 


28.3 Radix/precision considerations with FFT multiplication 


Now we look at the dependencies between the radix and the achievable precision with FFT multiplication. 


We use unsigned 16-bit words for the digits. So the radix of the numbers can be in the range 
2, 3,..., 65536(— 219). If working in base 10, we will actually use ‘super-digits’ of base 10,000, the 
largest power of 10 that fits into a 16-bit word. These super-digits are called LIMBs in hfloat. 


With very large precision we cannot always use the greatest power of the desired base, since the compo- 
nents of the convolution must be representable as integer numbers with the data type used for the FFTs: 
the cumulative sums cz have to be represented precisely enough to distinguish every (integer) quantity 
from the next greater or smaller value. The highest possible value for a cz will appear in the middle of 
the product and if the multiplicand and the multiplier consist of ‘nines’ (that is R— 1) only. For radix 
R and a precision of N LIMBs the maximal possible value L is 


L = N(R-1y (28.3-1) 


Note that with FFT-based convolution the absolute value of the central term can in fact equal |L| — 
N? (R — 1)?. But there is no need to distinguish that many integers. After dividing by N we are back at 
relation [28.3-1 
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Radix R max # LIMBs | max # hex digits | max # bits 
210 — 1024 1048, 576 k 2621,440 k 10240 M 
211 — 2048 262, 144 k 720,896 k 2816 M 
212 = 4096 65,536 k 196, 608 k 768 M 
215 — 8192 16384 k 53, 248 k 208M 
214 — 16384 4096 k 14, 336 k 56M 
215 — 32768 1024 k 3840 k 15M 
21% — 65536 256 k 1024 k 4M 
217 — 128 k 64k 272k 1062 k 
218 — 256k 16k 72k 281k 
219 — 512k 4k 19k 74k 
220 1M lk 5k 19k 
221 22M 256 1300 5120 
Radix R | max # LIMBs | max # dec digits | max # bits 
10? 110G 220G 730G 
10? 1100 M 3300 M 11G 
104 11M 44M 146 M 
105 110 k 550k 1826 k 
106 lk 6,597 22k 
107 11 77 255 


Figure 28.3-A: The maximal number of digits such that FFT multiplication with a mantissa of 53 bits 
can be used, for hexadecimal (top) and decimal (bottom) numbers. 


The number of bits to represent L exactly is the integer greater than or equal to 
logg,(N (R—1)?) = loga N +2 logs(R—1) (28.3-2) 


Due to roundoff errors there must be a few more bits for safety. If computations are made using double- 
precision floating-point numbers (C-type double) one typically has a mantissa (significand) of 53 bits. 
Then we need to have 


M > log,N +2 log,(R-—1)4+8S (28.3-3) 
where M :=mantissa-bits and S :=safety-bits. Using log,(R — 1) < log,(R) we obtain 
Nmas( R) = 94h ee) (28.3-4) 


Suppose we have M = 53 mantissa-bits and require S = 3 safety-bits. With base 2 numbers one could 
use radix R = 219 for precisions up to a length of Ninas = 2993216 = 256k LIMBs. Corresponding are 
4096 kilo bits and = 1024 kilo hex digits. For greater lengths smaller radices have to be used according 
to figure [28.3-A] (top, extra horizontal line at the 16 bit limit for LIMBs), the equivalent table for decimal 
numbers is shown at the bottom of the figure. Summary: 


e For decimal digits and precisions up to 11 million LIMBs (44 million decimal digits) use radix 10,000. 
For even greater precisions choose radix 1,000. 
e For hexadecimal digits and precisions up to 256,000 LIMBs (1 million hex digits) use radix 65,536. 


For even greater precisions choose radix 4,096. 


If convolution routines based on number theoretic transforms (NTT) are used (see section on page 
542), then no loss of precision can occur. However, the creation of a reasonably fast routine is quite 
nontrivial, see [155] for the implementation of the Schónhage-Strassen algorithm. 


NOOO BUONE 


mn 
O OoN AUNE 
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28.4 The sum-of-digits test 


With high-precision calculations it is mandatory to add a sanity check to the multiplication routines. This 
way false results due to loss of accuracy should (with high probability) be detected via the sum-of-digits 
test (the radix used is R): 


1. Compute the values (‘sums of digits’) sa = a mod (R — 1) and sy = b mod (R — 1). 
2. Compute the product c = a: b. 
3. Compute se = c mod (R — 1) and Sm = Sa : s; mod (R — 1). 
4. If se 4 Sm, then an error has occurred in the computation of c. 
The sum-of-digits function s, can for a radix-R, length-n number a be computed as [FXT: mult /auxil.cc]: 


ulong 
sum of digits(const LIMB *a, ulong n, ulong nine, ulong s) 


for (ulong k-0; k<n; ++k) s += alk]; 
s /= nine; 
return S; 


} 
where the variable nine has to be set to R—1, and s to zero. 


The computation of Sm = sq: Sẹ is done in 


ulong 
mult sum of digits(const LIMB *a, ulong an, 
const LIMB *b, ulong bn, 
ulong nine) 
1 
ulong qsa - sum of digits(a, an, nine, 0); 
ulong qsb - sum of digits(b, bn, nine, 0); 
ulong qsm = (qsa*qsb) % nine; 
return qsm; 
} 


The checks in multiplication routine [FXT: (mult /fxtmultiply.cc can be outlined as: 


fxt multiply(const LIMB *a, ulong an, 
const LIMB *b, ulong bn, 
LIMB *c, ulong cn, 
uint rx) 
1 
const ulong nine - rx-1; 
ulong qsm-0, qsp-0; 
qsm - mult sum of digits(a, an, b, bn, nine); 
// Multiply: c-a*b 
// 1f carrying through c gives an additional (leading) digit, 
// then set cy to that value, else set cy=0. 
qsp = sum of digits(g, n, nine, cy); 
if ( qsm!=qsp ) 41 /* FAILED */ > 
} 


If we assume that a failed multiplication produces ‘random’ digits in c, then the probability that a failed 
multiplication goes unnoticed equals 1/R. 


Omitting the sum-of-digits test is not an option: the situation that some number contains mainly ‘nines’ 
in the course of a high-precision calculation is very common. Therefore insufficient precision in the FFTs 
will almost certainly result in an error. 


The simplicity of the sum-of-digits test that uses the modulus R — 1 can be seen from the polynomial 
identity 


Ma R* = Ma; — modR-1 (28.4-1) 
k k 


ONDUA WN - 
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One can use other moduli, like 
Dar 
k 
Moduli R” — 1 for small n are especially convenient: 
»2 a, RF = 5 ay + R 5 Ak mod R? — 1 (28.4-3a) 
x = 


a, -R Y aj+R? Y aj mod R?—1 — (284-3b) 
k=0 mod 3 k=1 mod 3 k=2 mod 3 


n—1 
A | 5 a) mod R” — 1 (28.4-3c) 
U=0 


k=U mod n 


Il 
T 
nÓ 
UE 
Q 
> 
B 
o 
a 
Y 
+ 
E 


(28.4-2) 


Il 
M ii 


The probability of an unrecognized error is reduced to approximately 1/R”. The multiplication of the 
residues involves O(n?) operations. 


28.5 Binary exponentiation 


The binary exponentiation (or binary powering) scheme is a method to compute the e-th power of a 
number a, using about log,(e) multiplications and squarings. The term ‘number’ can be replaced by 
about anything one can multiply. That includes integers, floating-point numbers, polynomials, matrices, 
integer remainders modulo some modulus, polynomials modulo a polynomial and so on. In fact, the 
given algorithms work for any group: we do not need commutativity but a” . a” = q"*" must hold 
(power-associativity). 


28.5.1  Right-to-left powering 
This algorithm uses the binary expansion of the exponent: let e > 0, write e the base 2 as e = 
[e;; €j—15*- 4, €1, eo], ei € (0, 1}. Then 
at = gh © hn qte ges... pe (28.5-1a) 
= (a)? (a2)% (a4) ... (a?^)es (28.5-1b) 
We initialize a variable t by 1, generate the powers s; = a? by successive squarings s; = s2 , =(a2 )2, 


and multiply t by s; if e; panels 1. The following C++ code computes the e-th power of the (double 
precision) number a: 


double power_r21(double a, ulong e) 
1 
double t = 1; 
if (e) 
1 
double s = a; 
while ( 1) 
if (e& 1) ts; 
e /= 2; 
if ( O==e ) break; 
s *= S; 
} 
} 
return t; 


} 
An easy optimization is to avoid the multiplication by 1 if the exponent is a power of 2: 


double power_r21(double a, ulong e) 


if ( 0==e ) return 1; 
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5 double s = a; 

6 while ( 0==(e&1) ) 
7 { 

8 S *= S; 

9 e /= 2; 

10 F 

E a= s5; 

13 while ( 0!=(e/=2) ) 
14 { 

15 s *= 5S; 

16 if (e& 1) ax*=s; 
17 

18 return a; 

19 } 


The program [FXT: arith/power-r2l-demo.cc| shows the quantities that occur with the computation of 


p = 2%; 


arg 1: 2 == [number to exponentiate] default=2 
arg 2: 38 == e [exponent] default=38 

e=1..11. 

0 2 2 
1 4 2 
1 16 64 
0 256 64 
0 65536 64 
1 4967296 274877906944 


429 
p=ax*e = 274877906944 


In the right-to-left powering scheme the exponent is scanned starting from the lowest bit. 


28.5.2  Left-to-right powering 


'The left-to-right binary powering algorithm scans the exponent starting from the highest bits. We use 
the facts that a?* = (a^)? and a?**! = (a*)? a. Implementation is simple: 


1 double power l2r(double a, ulong e) 
2 1 

3 if ( O==e ) return 1; 

4 double s = a; 

5 ulong b = highest_one(e) ; 
6 while ( b>1 ) 

ih { 

8 b >>= 1; 

9 S *= 8; 

10 if (e&b) s *= a; 
11 } 

12 return s; 

13 > 


The program [FXT: ¡arith/power-12r-demo.cc| shows the quantities that occur with the computation of 


p = 238 when the left-to-right scan is used: 


arg 1: 2 == a [number to exponentiate] default=2 
arg 2: 38 == e [exponent] default=38 

e=1..11. 

1 2 

0 16 16 
1 256 518 
1 262144 524288 
0 274877906944 274877906944 
p-a**e = 274877906944 


All multiplications apart from the squarings happen with the unchanged value of a. This is an advantage 
if a is a small (integer) value so that the multiplications are cheap. As a slightly extreme example, if 
one computes 7” = +0.3759823526783 - 10695975 to full precision, then the left-to-right powering is about 
three times faster. If a is a full-precision number (and multiplication is done via FFTs), then the FFT of 
a only needs to be computed once. Thus all multiplications except for the first count as squarings. This 
technique is called FFT caching. 


The given powering algorithms are good enough for most applications. There are schemes that improve 
further. For repeated power computations, especially for very large exponents the schemes based on 
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addition chains lead to better algorithms, see [213] and [44]. The ‘flexible window powering method’ is 
described and analyzed in [112]. A readable survey of exponentiation methods is given in [160]. 


Techniques for accelerating computations of factorials and binomial coefficients are described in [207]. 


28.5.3 Cost of binary exponentiation of full- precision numbers 


With full-precision numbers the cost of binary powering is the same for both the left-to-right and the 
right-to-left algorithm. As an example, to raise x to the 26-th power, note that e = 26 = 110102 and we 
can write 


Z5 = ash. (AA (A a) (28.5-2) 


Here we need four squarings and two multiplications. In general one needs | log, e| squarings and h(e) — 1 
multiplications where h(e) is the number of set bits in the binary expansion of e. Figure lists the 
cost of the exponentiation for small exponents e in terms of squarings and multiplications and, assuming 
a squaring costs two FFTs and multiplication three, in terms of FFTs. The table was created with the 


program [FXT: arith/power-costs-demo.cc|. 
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e: | e (radix 2) | #S | #M | #F | #C e: | e (radix 2) | #S | #M | #F | #C 
dl. oq ues 1 0 0 0 41 zT. 5 2 16 | 15 
2— || xe L: 1 0 2 42 Md 5 2 16 | 15 
3 | ra 11 1 1 5 43 .1.1.11 5 3 19 | 17 
4 TER 2 0 4 44 S A EE 5 2 16 | 15 
5 .1.1 2 1 7 45 cd td 1 5 3 19 | 17 
6 .11. 2 1 7 46 EBEN 5 3 19 | 17 
7 ..111 2 2 10 9 47 .1.1111 5 4 22 | 19 
8 er eee: 3 0 6 48 .11. 5 1 13 
9 E EE 3 1 9 49 teed 5 2 16 | 15 

10 eed L. 3 1 9 50 11 5 2 16 | 15 

11 z21:11 3 2 12 | 11 51 E221 5 3 19 | 17 

12 ..11 3 1 9 52 .11.1 5 2 16 | 15 

13 11.1 3 2 12 | 11 53 11.1.1 5 3 19 | 17 

14 111. 3 2 12 | 11 54 11.11 5 3 19 | 17 

15 1111 3 3 15 | 13 55 .11.111 5 4 22 | 19 

16 1 4 0 8 56 .111 5 2 16 | 15 

17 1254 4 1 11 57 .111..1 5 3 19 | 17 

18 1 4 1 11 58 .111.1 5 3 19 | 17 

19 1..11 4 2 14 | 13 59 .111.11 5 4 22 | 19 

20 1:4. 4 1 11 60 .1111 5 3 19 | 17 

21 1.1.1 4 2 14 | 13 61 pe Ha Ea ra 5 4 22 | 19 

22 osi: 4 2 14 | 13 62 .11111 5 4 22 | 19 

23 ..1.111 4 3 17 | 15 63 .111111 5 5 25 | 21 

24 X M 4 1 11 64 lle 6 0 12 

25 den Ade d 4 2 14 | 13 65 1. vus 1 6 1 15 

26 ..11.1 4 2 14 | 13 66 loe 6 1 15 

27 Pear ele al 4 3 17 | 15 67 Ves odd 6 2 18 | 17 

28 Pega bx Bak 4 2 14 | 13 68 I5: 6 1 15 

29 lii 4 3 17 | 15 69 PES RE 6 2 18 | 17 

30 ..1111 4 3 17 | 15 70 12:211: 6 2 18 | 17 

31 11111 4 4 20 | 17 71 1...111 6 3 21 | 19 

32 slc 5 0 10 72 1-5. 6 1 15 

33 sacs 5 1 13 73 gees ee 6 2 18 | 17 

34 Pe RA E 5 1 13 74 T2113 6 2 18 | 17 

35 .1...11 5 2 16 | 15 75 1..1.11 6 3 21 | 19 

36 ales 5 1 13 76 1..11 6 2 18 | 17 

37 csl 5 2 16 | 15 77 1..11.1 6 3 21 | 19 

38 sda TL 5 2 16 | 15 78 1..111. 6 3 21 | 19 

39 .1..111 5 3 19 | 17 79 1..1111 6 4 24 | 21 

40 2blii 5 1 13 80 121.25 6 1 15 

Figure 28.5-A: Cost of binary powering of full-precision numbers for small exponents e in terms of 
squarings (74S), multiplications (72M) and FFTs (#F). If the left-to-right exponentiation algorithm with 


FFT caching needs fewer FFTs, then the number is given under (#C). 
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Chapter 29 


Root extraction 


We describe methods to compute the inverse, square root, and higher roots of a given number. The 
computation of any of these costs just the equivalent of a few full-precision multiplications. 


29.1 Division, square root and cube root 


29.1.1 Inverse and division 


The ordinary division algorithm is far too expensive for numbers of extreme precision. Instead one 
replaces the division $ by the multiplication of a with the inverse of d. The inverse of d is computed by 


finding a starting approximation £o ~ L, and then iterating 
Zk+1 = TktTk (1 — d xy) (29.1-1) 


until the desired precision is reached. The convergence is quadratic (second order), which means that the 
number of correct digits is doubled with each step: if £k = i(1 +e), then zy41 = i (1 - e). 


Moreover, each step only requires computations with twice the number of digits that were correct at its 
beginning. Still better: the multiplication z;(...) needs only to be done with half of the current precision 
as it computes the correcting digits (which alter only the less significant half of the digits). Thus, at each 
step we have 1.5 multiplications of the current precision: a full precision multiplication for d x; and a half 
precision multiplication for x;,(...). The total work amounts to 1.5 + 1.5/2 4- 1.5/4 4... = 1.5- 3 > 
which is less than three full precision multiplications. The cost of a multiplication is set to ~ N for the 
estimates made here, this gives a realistic picture for large N. Together with the final multiplication a 
division costs as much as four multiplications. 


The numerical example given in figure shows the first steps of the computation of an inverse 
starting from a two-digit initial approximation. 


The achieved precision can be determined by the absolute value of (1 — d x). In hfloat, if the achieved 
precision is below a certain limit, a third order correction is used to assure maximum precision at the 
last step: 


k+1 = TE+TL (1— dzy) + zy (1—da,)" (29.1-2) 


One should in general not use algebraically equivalent forms like x,41 = 2 £4 — dz? (for the second order 
iteration) because computationally there is a difference: cancellation can occur and the information on 
the achieved precision is not found easily. 


If the divisor has the same precision as the dividend, the division is called a long division. If the 
dividend fits into a machine word the operation can be done in linear time (short division). Similarly, a 
multiplication where both operands have full precision is called a long multiplication, and, if one operand 
fits into a machine word, a short multiplication. 
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d :— 3.1415926 

zo := 0.31 [initial 2-digit approximation for 1/d] 
d-xzo := 3.141-0.3100 = 0.9737 

yo :— 1.000—d-2 = 0.02629 
zo:yo :— 0.3100 -0.02629 = 0.0081(49) 

zı = Zo+To:Yo = 0.31004 0.0081 = 0.3181 
d-xı := 3.1415926-0.31810000 = 0.9993406 

yı :— 1.0000000 — d -xı = 0.0006594 
21:91 :— 0.31810000 - 0.0006594 = 0.0002097(5500) 

£2 := zi-cz1:gi = 0.31810000 + 0.0002097 = 0.31830975 
d.x3 := 3.1415926 - 0.31830975 = 0.99999955 

Ya :— 1.0000000 — d : x2 = 0.00000014 
xa Ya :— 0.31830975 - 0.00000014 = 0.000000044 

X3 := 2+x2- ya = 0.31830975 + 0.000000044 = 0.31830979399 


Figure 29.1-A: First steps of the computation of the inverse of 7. 


29.1.2 Inverse square root 
Computation of inverse square roots can be done using a similar scheme: find a starting approximation 


zo & P then iterate 


(1— dz) 


B (29.1-3) 


Trt. = t+ 


Convergence is again second order: if x, = wil +e), then 


_ it 35 la 
Chet = Vi (: 5* ze) (29.1-4) 


If the achieved precision is below a certain limit, a third order correction should be applied: 


a x NEC day 


Tk+1 = Cet L&E (29.1-5) 


To compute the square root, first compute 1/ Vd, then a final multiplication with d gives Vd. 


With squaring considered as expensive as multiplication, while FFT multiplication costs about 2/3 of a 
multiplication, we reach an operation count of four multiplications for computing 1/ Vd and five for Vd. 
This algorithm is considerably better than iterating zg41 := 4 (£k + i) because no long divisions are 
involved. 


A unified routine that implements the computation of the inverse a-th roots is given in [hfloat: 


src/hf/itiroot.cc|. The general form of the divisionless iteration for the a-th root of d is, up to third 


order: 


(29.1-6) 


(1— dz) " (1+a) (1— e 
a 2a? 


Thay = Tk (1. 
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The initial approximation is computed using ordinary floating-point numbers (type double) with special 
precautions to avoid overflow with exponents that cannot be represented with doubles. Third order 
corrections are made whenever the achieved precision falls below a certain limit. 


29.1.3 Cube root extraction 


We use the relation d!/? = d (d?) -!/?, That is, we compute the inverse third root of d? using the iteration 


1- d x 
Bupp = TkH p TT (29.1-7) 
and finally multiply with d. Convergence is second order: if xg = 3; +e), then 
1 T— TE 
k41 = vi (: 2e 3€ — 3* ) (29.1-8) 


29.1.4 Improved iteration for the square root 


Actually, the ‘simple’ version of the square root iteration (£41 := 3 (zy + i) can be used for practical 
purposes if rewritten as a coupled iteration for both Vd and its inverse. For Vd we use the iteration 


z2—d 
Skip] = Te E) (29.1-9) 
2—d 
= Xk— Uk+1 (rcd where v=1/x (29.1-10) 
For the auxiliary v ~ 1/Vd we iterate 
Uk+1 = Uk + Uk (1 — Tk Uk) (29.1-11) 
We start with approximations 
ty & vd (29.1-12) 
v = 1/xo0 (29.1-13) 


The v-iteration must precede that for x in each step. If carefully implemented, this method turns out to 
be significantly more efficient than the computation via the inverse root. An implementation is given in 


[hfloat: src/hf/itsqrt.cc|. The idea is due to Schónhage. 


29.1.5 A different view on the iterations 


Let p be a prime and assume you know the inverse zo of a given number d modulo p. With (the iteration 


for the inverse, relation |29.1-1 on page 567) (x) := x (1+ (1— dz)) the number zı :— (xo) is the 


inverse of d modulo p?. Modulo p? we know that xod = (1+ kp) so we can write zo = 1/d(1+ kp), 
thereby 


(zo) = » (5010) > (1 kp’) = mod p? (29.1-14) 


The very same computation (with zı = 1/d (1 + j p?)) shows that for x2 :— (x1) one has 12 = 1/d mod 
pt. Each application of Y doubles the exponent of the modulus. 


'The equivalent scheme works for root extraction. We give an example for the inverse square root. With 
p = 17 and zo = 3 we have z2 2 = 1. That is, zo is the inverse square root of 2 modulo p (xo = 1/Vd mod p 
where d = 2). Now use the iteration (x) := x(1 + (1— dx)/2) to compute z; = (ro) = —45/2 = 
122 mod p and observe that xz? d = 1 mod p?. Compute 12 = (x1) = —1815665 = 21797 mod p? and 
check that z2d = 1 mod p*. After k steps we have zy = 1/Vd mod p?'. 


The arithmetic is very similar to the arithmetic of power series. Note that GP allows doing such compu- 
tations as follows: 
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? 1/sqrt(2+0(1775)) 

3 + 7*17 + 7#1772 + 4x1773 + 11*1774 + 0(1775) 
\\ Note that 21797 = 3 + 7*17 + T*17^2 + 4*17^3 + 11*17^4 
\\ and 122 = 3 + 7*17 


Section }1.21 on page 56| describes the case p = 2. The computation of a square root modulo p^, given a 
square root modulo p, is described in section [39.9.2 on page 785 


29.2 Root extraction for rationals 
We give expressions for the extraction of the a-th root of a rational quantity. 
29.2.1 Extraction of the square root 


A general formula for a k-th order (k > 2) iteration for Vd is 


$,(r) = vd (2+ va) E = vay =vd (P+ ava) + (p= ava) (29.2-1) 
O a R 


where x = p/q. All Vd vanish when expanded: 


z? +d ? dq? 
Bala) = = = E (29.2-2a) 


x? -- 3d p P +3dg 


o = = 29.2-2b 

s(x) MEF q 3p + dg? ( ) 
A d 2 2 A 6d 2:52 d? 4 

a(z) = dl aie T a A (29.2-2c) 
423 +4da 4p? q+4dpq 
^--10dz? 45d? 4+10dp?q?+5d? q* 

EC MERE A al. PLUMA (29.2-2d) 
5z* +10dzr? +d q 5p* + 10dp? q? + d? q 

DER sna 

Drle) = 2 DR 2j = (29.2-2e) 

2 5=0 acu) d Ie 
The denominators and numerators of ®; are terms of the second order recurrence 

ak = 22 441 — (a? — d) Ak—2 (29.2-3) 


with initial terms ao = 1, a1 = x for the numerators and ay = 0, a4 = 1 for the denominators (that is, 
Do = 1/0, ®; = z/1). An equivalent form of relation |29.2-1| is 


®,(x) = vd cot (: arccot A) (29.2-4) 


Setting d = —1 and x = cot(z) we find 

cot(kz) =  D(cot(z)) (29.2-5) 
From this relation we deduce the following composition law: 

$,(Ó6,(r) = Omn) (29.2-6) 


There is a nice expression for the error behavior of the k-th order iteration: 


dy, (vá. 155) Low (29.2-7) 


1 — ek 
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29.2.2 Extraction of the r-th root 


A second order iteration for 4/z is given by 


d — a” —l)a™+d 1 d 
exo) = 24 SSE = AT S + 5) (29.2.8) 
rz rz r x 
A third order iteration for Vd is 
az’ + 6d p ap'--6qd 
$ = : = Es 29.2-9 
s(x) L Bar +ad a Bp tagd ( ) 


where x = p/q, a =r — 1 and B =r +1. An alternative form is 


d PBa+ad 


where again a = r — 1 and 8 =r +1. 


29.2.3 Rational iterations for roots 1 
Rational iterations can also be obtained using Padé approximants. 
29.2.3.1 Square root 


Let Py, ¡¡(2) be the approximant of yz around z = 1 of order [i, j]. An iteration of order i +j +1 is given 
by x Pia). Different combinations of i and j result in alternative iterations: 


x d d 
[,j]] > x Pj (3) (29.2-11a) 
etd 
1 29.2-11b 
0) > 55 (29 ) 
2g? 
x? + 3d 
1,1 29.2-11 
i 5 “+ d ua a 
4 2 d? 
NES PE (29.2-11e) 
8x3 
5 
0,2 4 iu (29.2-11f) 


15x4 — 10dx? + 3d? 


Still other forms are obtained by using d pi (=): 


d 2 
53]  — ifa (5) (29.2-12a) 
2+d 
L0] o — (29.2-12b) 
HH 
2d? 
61 ^ pe (29.2-12c) 
d (d + 32?) 
les. rer (29.2-12d) 
d dd 2 
2,0 > se e E (29.2-12e) 
MH 
3 
02 > ed (29.2-12f) 


3x4 — 10dz? + 15d? 


&« oo-1o0»oc0v iR cb.- 


OU OON e 
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29.2.3.2 r-th root 


The Padé approximants for the r-th root can be expressed as ratios of hypergeometric series (using 


relation |36.2-9| on page [689]: 


u, v 2) /F u t l/r, v 4- 1/r 
uc v4 l/r u+u+1/r 
The expression on the left gives the oa li, j] if we set u = —i and v = —j — 1/r (so both series 


terminate). The iteration Dj, ; = 1 Pj (4 -) has order i+ j +1. We compute the third order iteration 
for the fourth root (relation|29.2-9| with r = 4): 


Xr hypergeom.gpi \\ definition of hypergeom() 

r-4; \\ r-th root 

i=1; \\ degree of denominator 

j=1; \\ degree of numerator 

u--i;v--j-1/r; \\ setup parameters so that series terminate 

N=hypergeom( [u,v], [utv+1/r] ,x,i) 

-5/8*x + 1 

? D=hypergeom([ut1/r,vt+1/r] , [utv+1/r] ,x,j) 
-3/8*x + 1 

? t=N/D \\ Pade approximant [i,j] 

(-5*x + 8)/(-3*x + 8) 


\\ check t == (1-x)^(1/r) + order (itjt1): 

7 n-itj*2; 

? t-hypergeom([-1/rl, [1,x, n) *O (x^ (n)) 
5/256*x^3 + 0(x^4) 


2) = (1-21 (29.2-13) 


NN NIN NN 


? t=subst(t,x,1-x) \\ Pade approximant in x 
(5*x + 3)/(3*x + 5) 

? it-x*subst(t,x,d/x^r) \\ iteration for f(x)-x^r-d 
(3xx"5 + 5*d*x)/(5*x74 + 3*d) 

\\ == x * (3*x"4 + 5*d)/(5*x74 + 3*d) 


Now we check the order of the iteration, we set d = 1 and compute f (9 (d'/"(1-- e))) = f (9(1- e)): 


f(x)-x^r-1; AM --x^r-d for d--1 

it-subst(it,d,1); AN d==1 

er=subst(it,x,(1+e)); AN Phi( d^(1/r))*(1*e) ) = Phi(1+e) 
taylor(f(er),e) \\ f( Phi(1+e) ) =?= O(e” (i+j+1)) 

5*e73 - 15/2*e^4 + [...] NN OK 


NNN ON 


An alternative expression for iterations of order i + j +1 is ®j j = d Pis (=). The approximant P. 
leads to the third order iteration given as relation [29.2-10 


In section [30.5.2 on page 595| Padé approximants are used to find iterations for arbitrary functions f. 
29.3 Divisionless iterations for the inverse a-th root 


There is a nice general formula that gives iterations with arbitrary order of convergence for 1/ Vd = d- !/* 
that involve no long division. We use the identity 


d"e = g(1—-(1—-z*d) V^ = z(1—gy)-V* where y:— (1— z^ d) (29.3-1) 
Expansion as a series in y gives 


d We = z Y^ (Jay y (29.3-2) 


where 2% :— z (z + 1) (z +2) ... (z - k — 1) (and 2: 
1 


* is the rising factorial power), written out: 


= 1, zh 
(1+a) y? y CEDU + 2a) y 
Y m 


6 a? 
_ (1+a)(1+2a)(1+3a)yt | — | pay (1+ka) 


24a* nia” 


die = g 


=x | ae (29.3-3) 
a 


n | 
BEEE 
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An n-th order iteration for d^ !/^ is obtained by truncating the above series after the (n — 1)-th term: 


n—1 
Tki = nlt) where n(x) = 2) (1/0) y" (29.3-4) 
k=0 
Convergence is n-th order: 
ðn (aver + e)) = d-V* (1 + O(e")) (29.3-5) 
For example, the second order iteration is 
1-dx* 
(xz) := + a (29.3-6) 
a 
Convergence is indeed quadratic: if z = Y (14 e), then 
i 1 à |. 1 a+1 > A 3 
aa) = y (a +e) [a +e)" — (a+ 1) E. (: ; € + Ole ) (29.3-7) 
29.3.1 Iterations for the inverse 
Set a = 1, y = 1 — d x to compute the inverse of d. 
1 1 
= = PE 29.3-8 
d e y 2) 
Dia) = s (1+y+y? +y? tyt y) (29.3-8b) 


For example, B2(1) = x (1 + y) is the second order iteration |29.1-1 on page 567 


Composition is particularly simple with the iterations for the inverse: 
$,4(xr) = 9, (0,,.(1)) (29.3-9) 
There are simple closed forms for this iteration: 


1=y* 1—y* 


DP. = = 29.3-1 
k 7 Ta 7 (29.3-10a) 
e = 1+r+r? +r +r... (29.3-10b) 
= g(lty)+y’)(1+y4)(1+y%) ... (29.3-10c) 
= t(ltyty)A+y?+y®)(1+y? +y)... (29.3-10d) 

The expression for the convergence of the k-th order iteration is 
1 1 

o; G 9) E (1 — (—e)*) (29.3-11) 


The iteration converges if Je] < 1 for the start value zo = 4 (1 + e). That is, the region of attraction is 
the open disc of radius r = 1/d around the point 1/d, independent of the order k. For other iterations, 
the region of attraction usually has a fractal boundary and further depends on the order. 


29.3.2 Iterations for the inverse square root 


Set a = 2, y = 1 — da? to compute the inverse square root of d. 
, Y 


1 1 
D (29.3-12a) 
vd V1-y 
y , 3y2 5y? 35y* ($)y* 
= E + sas Sele 29.3-12 
a i 16 ^ 18 * g t (9E) 
y y G) y 
m EN 29.3-12 
$, (x) x (: 5 + 3 + 1k (29.3-12c) 
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$5(x) = x (1 + y/2) is the second order iteration |29.1-3 on page 568 
29.3.3 Computation of the a-th root 


? default (realprecision, 55) ; 
n=5; d=3; 
fex ar d 


? e Elis x^ (n+1)/(n*d))); 
? phi(x) 
-1/15*x^6 + 6/5*x 
y=real (polroots(f) [1]) 
1.245730939615517325966680336640305080939309993068779811 
y*=(1.01); NV <--= initial approximation within 1% 
for(k=0,7, t=phi(y); print(k,": ",y); y=t; ); 
1.258188249011672499226347140006708131748703092999467609 
1.245352199888209161292281504236361352521387343922682049 
.245730594310665132338126760084832140850833621880667303 
.245730939615230180340553343162793425783500280138356272 
.245730939615517325966680138075892858940624403124874962 
.245730939615517325966680336640305080939309993068684860 
.245730939615517325966680336640305080939309993068779811 
.245730939615517325966680336640305080939309993068779811 
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Figure 29.3-A: Quantities occurring in the iterative computation of 4/3. 


The following (second order) iteration computes Vd directly: 


B(x) = z- (: ^) (29.3-13) 


Figure |29.3-A| shows the quantities occurring in the iterative computation of 3. The iteration involves 
no long division for small (rational) d. 


To compute the a-root of a full-precision number d, we can use the iteration for the inverse root and 
invert afterwards. Another possibility is to compute the inverse a-th root of d^-! and multiply with d 
afterwards: 


—1/a 


[(a)*-1] d — da -9/a q = qa (29.3-14) 


If a is small the cost is lower than with the final iteration for the inverse (which costs about three 
multiplications or nine FFTs). The powering-method is not more expensive than inversion if the rightmost 
column in figure [28.5-A] on page [566] for e =a— 1 < 6. If the iteration for the inverse involves a loss of 
precision, the method might be preferred even if its cost is higher. 


29.3.4 Error expressions for inverse square root iterations t 


An expression for the error behavior of the n-th order iteration similar to relation|29.2-7 on page 570|is 


1 

E, — 6, (e 22) ja i2 (29.3-15a) 

Ro (Ce — aim EJES 

= = = 29.3-1 
p (29.3-15b) 
Now define c :— FD Eu. then 
2n—1 k 
1 m»: e 

E. l0 hee du £o (nn) Co) (29.3-15c) 


1-c dor E Meer 
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For example, with n = 2 we have 


1 
Fy := 0), (a=) jg (29.3-16a) 
= € 
1—3e-3e? +e? 
= 29.3-16b 
1-3e+3e?—e? ( ) 
= 1-=6e — 16e —30¢*—48e° — T0 c8 —... (29.3-16c) 
4 
— 7 7 5 (29.3-16d) 
(e — 1) (e — 1) 
1 — 
Eh = | haeo (29.3-16e) 
1—c 1-3e 
For n = 4 we have 
1 
Fy := 9, (ets) pg (29.3-17a) 
—€ 
= 21 2 > 3 4 21 5 6 T 
= 1-7e+2le 35e 35 e^ + 21e Te? +e (29.3-17b) 
1—7e+21e? — 35e3 + 35e* — 21e? + 1 e8 — e? 
= 1- 70e! -— 448e? — 1680 eê — 4800 e7 — 11550 eè —... (29.3-17c) 
1 14 40 
"UNE i = z + z + = (29.3-17d) 
(e — 1) (e — 1) (e — 1) (e — 1) 
l+e 4 @—Te?+21le—35 
Fy = h = 29.3-17 
fae. AAA 35e ( 9 
Two curious formulas related to the error behavior of ®2 are 
1 1 1 1 1 
6, ( — = a LT eae 29.3-18 
Cal] - val alts) anaas 
1 21 1 1 2.1 
Del — le- 2- ES =e? 29.3-19 
(Zale-32]) 7 vts i2) anam 


29.4 Initial approximations for iterations 


With the iterative schemes we always need an initial approximation for the value to be computed. Assume 
we want to compute f(d), for example, f(d) = Vd or f(d) = exp(d). We could convert the high precision 
number d to a machine floating-point number and use the floating-point unit (FPU) to compute an 
initial approximation. However, when d cannot be represented with a machine float, the method fails. 
The method will also fail if the result causes an overflow, which is likely to happen with f(d) — exp(d). 
The methods given here avoid this problem. 


29.4.1 Inverse roots 
With f(d) = d'/¢ use the following technique. Write d in the form 
d = M.R* (29.4-1) 
where M is the mantissa, R the radix, and X the exponent. We have 0 € M < 1 and X € Z. Now use 
d/« = MWV^*.gX/a = w^. RY.. RZ (29.4-2) 


where Z = |X/a| and Y = X—a-Z (so X =a-Z+Y). Compute the three quantities on the 


right side of relation |29.4-2| separately and finally the product as result. An implementation is [hfloat: 
src/ hf/itiroot.cc|: 


O ONMOBW Ne 


00 DIAM NA 
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void 
approx_invpow(const hfloat &d, hfloat &c, long a) 
{ 


double dd; 
dt mantissa, to double(*(d.data()), dd); 
dd = pow(dd, 1.0/(double)a); // M^(1/a) 


d.expO / a; 1/1 Z 
d.expO - a*Z; // Y 


X/a 
X%a 


long Z 
long Y 


double tt = pow((double)d.radixO ,(double)Y/a); // R^(Y/a) 
dd *- tt; // M^(1/a) * R^(Y/a) 


d2hfloat(dd, c); // c= M'(1/a) * R^(Y/a) 
c.exp( c.expO+Z );  // c *= R^(Z) 
} 


We could also subtract a- Z from the exponent before the iteration and add Z to the exponent afterwards: 


RX-a2)!/^ = RY/o = RAR? In that case the initial approximation can be computed via the 


straightforward approach. 


29.4.2 Exponential function 
With f(d) = exp(d) write 


exp(d) = M-R* (29.4-3) 


where X = |d/log(R)| and M = exp(d — X - log R). The argument d must fit into a machine float which 
is not a restriction: for values d that are too big for a machine float exp(d) will not fit into a hfloat 
type (the exponent of the result would overflow). Compute the initial approximation to exp(d) as follows 


[hfloat: src/tz /itexp.cc|: 


void 
approx exp(const hfloat &d, hfloat &c) 
1 


double dd; 
hfloat2d(d,dd); 


double lr = log( hfloat::radixO ); 


double X = floor( dd/lr ); 
double M = exp( dd-X*lr ); 


d2hfloat(M,c); 
c.exp( c.exp()+(long)X ); 
} 


An iteration for the computation of the exponential function is given in section |32.2 on page 627 
29.5 Some applications of the matrix square root 


We give applications of the iteration for the (inverse) square root to compute re-orthogonalized matrices, 
the polar decomposition, the sign decomposition, and the pseudo-inverse of a matrix. 


29.5.1 Re-orthogonalization 


A task from graphics applications: a rotation matrix A that deviates from being orthogonal (for example, 
due to cumulative errors resulting from many multiplications with rotation matrices) shall be transformed 
to the closest orthogonal matrix E. We have (see [295]): 


E = A(ATA)3 (29.5-1) 


With the divisionless iteration for the inverse square root 


$() = x (+50 da?) 4 30 ay + ad +...) (29.5-2) 
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the given task is easy: as ATA is close to unity (the identity matrix) we can use the (second order) 
iteration with d = ATA and z = 1 


1— ATA 
(ATA) 3 m (1 + 4) (29.5-3) 
and multiply by A to get a ‘closer-to-orthogonal’ matrix A+: 
1— ATA 
A, = A (1+ EA) ~E (29.5-4) 


The step can be repeated with A, (or higher orders can be used) if necessary. Note that the iteration is 


the one for the computation of the inverse square root of 1 (relation |29.1-3 on page 568| with d = 1): 


1-1-2? 
r4 = s (15) EI (29.5-5) 


For scalars sufficiently close to 1 the iteration converges to 1. For matrices not too far from a matrix E 
such that ET E = 1 (that is, E is orthogonal) the iteration converges to E. 


It is instructive to write things down in the singular value decomposition (SVD) representation 
A = UnvT (29.5-6) 


where U and V are orthogonal and Q is a diagonal matrix with non-negative entries, see |367]. We note 
that the SVD is not unique, for example, for the 1 x 1 matrix [-2] we have [- 2] = [-1] [2] [1] = [1] [2] [-1]. 
The SVD is a decomposition of the action of the matrix as: rotation — element-wise stretching — rotation. 
Now 


ATA = (VRUT) (UAV?) = vo?v? (29.5-7) 
Thus (using the equality (VQV7)" = VO"V7) 


i 
2 


(474) = (vau?) wav?) = (VVT)? = vaciyT (29.5-8) 


and we have 
A(ATA)y3 = (UQVT) (VO-1VT) =U v? (29.5-9) 
that is, the ‘stretching part’ was removed. 


A numerical example: for 


+1.0000000 +1.0000000 +0.7500000 
A =  |-—0.5000000 4-1.5000000 +1.0000000 (29.5-10) 
-F0.7500000 +0.5000000 —1.0000000 


we have 


+0.803114165 +0.291073143 +0.519888513 
E = |-—0.486897253 +0.823533541 +0.291073143 (29.5-11) 
+0.343422053 +0.486897252 —0.803114166 


and EET = 1, 
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29.5.2 Polar decomposition 
The polar decomposition of a matrix A is a representation of the form 
A = ER (29.5-12) 


where the matrix E is orthogonal and R = RT. It is analogous to the representation of a complex number 
z € C as z = £ Êr (identify R ~r and E ~ e*?). The polar decomposition can be defined by 


A = ER:= (atar Ay?) ((47 472) (29.5-13) 


where R = (ATA)? and E = A(AT A)-!7, The matrix E is computed as before: 


1— ATA 1— ATA 
E = A-(1+ ZEE PEEN ons (29.5-14) 
2 2 
The matrix R equals E-1A = ET A, that is 
A = ER = E(E' 4) (29.5-15) 
UVT Ver?) (29.5-16) 
Compute the polar decomposition as 
Eo = A (29.5-17a) 
1— ET E 
Y, = (: + q) (29.5-17b) 
Ex+1 = Ex Y, >E (29. 5- 17c) 
Ri = EQQ4A >R (29.5-17d) 
Ek+yı Rept > A (29.5-17e) 


Higher orders can be added in the computation of Yp. If you prefer z = re’? over e’? r, then iterate as 
above but set R' = A ET so that 


A = RE = (AET) E (29.5-18) 
= (UQUT) UVF (29.5-19) 
Numerical example: for 


+1.00000 +1.00000  4-0.75000 
A = |-—0.50000 41.50000 +1.00000 (29.5-20) 
+0.75000 +0.50000 —1.00000 


we have 
A = ER (29.5-21a) 
+0.80311 +0.29107 +0.51988]} |+1.30412 +0.24447  —0.22798 
= —0.48689 +0.82353 +0.29107| |+0.24447 +1.76982  4-0.55494 (29.5-21b) 
+0.34342 +0.48689 —0.80311| |—0.22798 +0.55494  4-1.48410 
A = RE (29.5-21c) 
+1.48410 +0.55494 -+0.22798| |+0.80311 +0.29107 +0.51988 
= +0.55494 +1.76982 —0.24447| |—0.48689 +0.82353 +0.29107 (29.5-21d) 
+0.22798 —0.24447 +1.30412| |+0.34342 +0.48689  —0.80311 
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29.5.3 Sign decomposition 


The sign decomposition can be defined as 
A= SN = (aui) (a) (29.5-22) 


where N = (42)71/2 and S = A(A?)~1/?. The square root has to be chosen such that all its eigenvalues 
have positive real parts. The sign decomposition is undefined if A has eigenvalues on the imaginary axis. 
The matrix S is its own inverse (its eigenvalues are +1). The matrices A, S and N commute pair-wise: 
SN=NS, AN= NA and AS = SA. 


Use 
So = A (29.5-23a) 
Y = (1+ 5%) (29.5-23b) 
Skat SkY >S (29.5-23c) 
Nui = SrA >N (29.5-23d) 


Numerical example: for 


+1.00000 +1.00000  4-0.75000 
A = |-—0.50000 +1.50000 +1.00000 (29.5-24) 
+0.75000 +0.50000 —1.00000 


we have 
A = SN (29.5-25a) 
+0.90071 —0.01706 +0.29453] |+1.13014 +1.02237 +0.36392 
= —0.24065 +0.95862 +0.71389} |—0.18454 +1.55423 +0.06423 (29.5-25b) 
+0.62679 +0.10775 —0.85933| |—0.07158 +0.35875 +1.43718 


where S S = 1. See and also [181]. 


29.5.4 Pseudo-inverse 
While we are at it: define a matrix A* as 
At s (AAT) A -(vov^)(ivout)svir'v' (29.5-26) 
This looks suspiciously like the inverse of A. In fact, this is the pseudo-inverse of A: 
AtA = (vo UT) (UQVT) 21 but wait (29.5-27) 


A* has the nice property to exist even if A^! does not. If AT! exists, it is identical to At. If not, 
ATA £ 1 but A* will give the best possible (in a least-square sense) solution z* = A*b of the equation 
Az — b (see [115] p.770]). To find (A AT)-! use the iteration for the inverse: 


(x) =x (1+ (1— dz) + (1— dz)? +...) (29.5-28) 
with d= A AT and the start value zo = 2— n (A AT)/ ||A AT || * where n is the dimension of A. 


A GP implementation of the pseudo-inverse using the SVD: 


matpseudoinv(A)- 
\\ Return pseudo-inverse of A 


Ro NA 


local(t, x, U, d, V); 
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7 A 
[+1.00 +1.00 
[-0.50 +1.50 
[+0.75 +0.50 


E-matSVD (A); 


[40.644401153492 +0 
[-0.695372132379 +0 
[-0.318126941467 -0 


+0.75 
+1.00 
-1.00 


+ 
+ 


ux 


d 

[40.95641003 0 0] 
[o +5.09161169 0] 
[0 0 +1.74234618] 


V 

[+0.787833655771 
[-0. 583139548860 
[+0 . 110889336609 
[+0 . 164225309908 


Ax=matpseudoinv (A) 

[+0 . 744034618880 -0 
[-0 . 093004327360 +0 
[+0.095446097914 -0 
[+0 . 138692567521 -0 


-0 
+0 
+0 
+0 


2.00] 
3.00] 
3.00] 


. 067332385426 
. 227598489665 
. 313647647860 
. 919396775264 


.497415792829 
.562176974103 
.041347314730 
.016875347613 


U-t[1]; d-t[2]; V-t[3]; 


.438818890 +0. 
.676976586 +0. 
.590881276 +0. 


6262468643] 
2411644651] 
7413869203] 


*0.60935354299] 
*0.77980313700] 
-0.01752654388] 


-0.14243646797] 


+0 
+0 
-0 
-0 


. 005046325813] 
.499369209273] 
.080741213017] 
.221929812661] 


A*Ax 

[+1 . 0000000000000 
[+2.52435489 E-29 
[-2.52435489 E-29 


+3 
+1 
-2 


. 78653234 E-29 
. 0000000000000 
.52435489 E-29 


-4 


.41762106 E-29] 
-2. 
+1. 


52435489 E-29] 
0000000000000] 


Ax*A 

[+0.9965272596551 
[+0 . 0004340925431 
[+0.0555638455173 
[-0.0193171181681 


+0 
+0 
-0 
+0 


.0004340925431 
.9999457384321 
.0069454806896 
.0024146397710 


+0. 
-0. 
+0. 
+0. 


0555638455173 
0069454806896 
1109784717229 
3090738906900 


-0.0193171181681] 
*0.0024146397710] 
*0.3090738906900] 
*0.8925485301897] 


ROD00 DD 


Re 


«oo =D ANAMNnNA 


Figure 29.5-A: Numerical example for the pseudo-inverse computed by the SVD. We use a 3 x 4 matrix 
which is definitely not invertible. A working precision of 25 decimal digits was used, so A At = 1 to 
within that precision. On the other hand, At A is not close to the unit matrix. 


t = matSVD(A); 
U = t[1]; d-t[2]; V = t[3]; 
for (k=1, matsize(d) [1], 


x-d[k,k]; if (x>1e-15, d[k,k]-1/x, d[k,k]=0); 


raturn( V*d*U^ ); 
} 


Where the SVD is computed with the help of a routine (qf jacobi()) that returns the eigenvectors of a 


real symmetric matrix: 


matSVDcore(A)= 
\\ Singular value decomposition: 
\\ Return [U, d, V] so that U*d*V~==A 
\\ d is a diagonal matrix 
\\ U, V are orthogonal 
{ 
local(U, d, V); 
local(t, R, di); 
R = conj(A^)*A; NN R==V¥d"2*V~ 
t = qfjacobi( R ); \\ fails with eigenvalues==zero 
V = t[2]; 
d = real(sqrt(t[1])); 
di d; 


\\ returned quantities 


for (k=1, length(d1), t-di[k]; if (abs(t)>1e-16, t=1/t, t=0); d1[k]=t ); 


di = matdiagonal (d1); 
d = matdiagonal (d) ; 
U = (A*V*d1); 

return( [U, d, V] ); 
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The core routine is always called with a matrix A whose number of rows is greater than or equal to its 
number of rows. 


PARRER 
local(tq, t, U, d, V); 
t = matsize(A); 
tq-0; if ( t[1]<t[2], tq-1; A-A^; ); 
t = matSVDcore(4); 
d = t[2]; 
if ( tq, 


U-t[3]; V-t[1]; 
U-t[1]; V-t[3]; 


return( [U, d, V] 5; 
} 


For a numerical example see figure The connection between the SVD of a matrix a and the 
eigenvectors of AT A is described in [198]. 


29.6 Goldschmidt’s algorithm 


A framework for the Goldschmidt algorithm can be stated as follows. Let A, B, a, and b be integers. 
Initialize 


zo = d^, Ey = d? (29.6-1a) 
then iterate 
1-E 
Pe = ie (29.6-1b) 
k41 = ry PË — 45-7 P b/a (29.6-1c) 
Eg. = EQPR 1 (29.6-1d) 


The algorithm converges quadratically. The updates for x and E (last two relations) can be computed 
independently. The iteration is not self-correcting, so the computations have to be carried out with full 
precision throughout. 


An invariant of the algorithm is given by z2/E?: 


Thay (T° Pe Tk 
= = 29.6-2 
EL, (Ex PEP T BR M 
We use the relation 
a Aa 
wm d _ ¡Aa—Bb 
El = Pe > d (29.6-2b) 
and, as E converges to 1, we find that 
To _ a _ T0 
The quantity computed is d4-P */a; 
ay 1/a A 
- A us A A gA-BI/ : 
T% = (5) = Be = BE 7 d (29.6-2d) 


We consider some interesting special cases in what follows and set b = 1. 
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29.6.1 Algorithm for the a-th root 
Solving A — B/a = 1/a gives B = Aa — 1 and especially A= 0, B = 1. That is, set 


to = d, Ep = d! (29.6-3a) 


y ick 


P, := 1 a eil (29.6-3b) 
Uk+1 ‘= Tk: Pr (29.6-3c) 
Eki = Ex j PE —1 (29.6-3d) 


Setting a — 2 gives an algorithm for the computation of the square root: 


04 TT 37 Ex 
vd = wet ; (29.6-4) 


where Ey = d, Ep41 := Ex (EE 


An algorithm for the inverse a-th root is obtained by solving A— B/a — —1/a: B — Aa--1 and especially 
A=1, B=a-—1. That is, set ry = 1 and Ey = d, and iterate as in relations |29.6-3b|[29.6-3d| until x 
1 


close enough to tx = d^ 3. 
Setting a = 1 gives an algorithm for the inverse (P; = 1 + (1 — Ex) = 2 — Ex): 


E - I[e-E,) (29.6-5) 


where Eo = d, Exi = Ex (2 = Ex). 
Setting a = 2 gives an algorithm for the inverse square root (P, = 1+ (1 — Ex)/2 = (3 — Ex)/2): 


1 Erde. 
— = 29.6-6 
Ji lI > (29.6-6) 


where Ep = d, Ep1 := Ex (35Be)2, 


29.6.2 Higher order algorithms for the inverse a-th root 


Higher order iterations are found by appending higher terms to the expression (1 + 141) in the defini- 


tions of Pķ+1, as suggested by equation |29.3-3 on page 572|and the identification y = 1 — E: 


Exi = Ex PR where (29.6-7) 
i= 
P, = 14+—— [second order] (29.6-8) 
1+ 1— Ey 
( ou 2 [third order] 
1+ 1+ 2a) (1 — Ej y? 
nuc P a k) [fourth order] 
a 
E S 


(1 +a) (1+2a)...(1+(n—1)a)(1— Ep)” 


n! a” 


[order (n + 1)] 
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To = 1 

Ep = 2.0 

Po = 0.90625 
bo = 0.0 

zı = 0.90625 


E, = 1.3490314483642578125 
P, = 0.9317769741506936043151654303073883056640625 
by = 1.5185 


z2 = 0.844422882824066078910618671216070652008056640625 

Ez = 1.01688061936626457410433320872039209492188542219092968 
P, = 0.9958243694256508418788315054034553239996718905629355821 
b2 = 5.8884 


23 = 0.84089688481686585202128466050063877105975287 18627830956 
E = 1.00000223363323283559583877024921007574068879685671957 
P, — 0.9999994415924713406977321191709309975809003013470607162 
ba = 18.772 


$4 = 0.8408964152537145441292683119973118637849080485731336497 
E, = 1.000000000000000005223677094319714797043731882484224637 
P4 = 0.9999999999999999986940807264200713050026299021477654907 
ba = 57.409 


X5 = 0.8408964152537145430311254762332148950400342623567845249 
Es = 1.000000000000000000000000000000000000000000000000000067 
Ps = 0.9999999999999999999999999999999999999999999999999999833 
bs = 173.32 


1/42 = 0.8408964152537145430311254762332148950400342623567845108. . . 


Figure 29.6-A: Numerical quantities occurring in the computation of 1//2 using a third order Gold- 


schmidt algorithm. The value |b, gives the number of correct bits after step k. 


For example, the inverse fourth root of d = 2 can be computed via the third order algorithm 


To = 1 
Ej; = d=2 
45 — 18 Ep +5 E2A* 
Ep = Ey Pe = Ex ( = E) 
Tk+1 = TkPk 


(29.6-9a) 
(29.6-9b) 


(29.6-9c) 


(29.6-9d) 


Figure shows the numerical values of xj, Ej and Pj up to step k = 6. The approximate precision 


in bits of z; is computed as bẹ = —log(|1 — Ex|)/log(2). 


29.7 Products for the a-th root 1 

Rewrite the well-known product form 

pu = 0 y) (9) 0 y) 0v... 
as 


=P I[a^vo where Yi := y, Yu xem YE 
TM k>0 


We give product forms for a-th roots and their inverses that generalize the relations above. 


(29.7-1) 


(29.7-2) 
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29.7.1 Second order products 
For the inverse square root use 1/./1 — y = (1+ y/2): 1/4/1 — y? (3 + y)/4, thereby 


1 3 
+ = [[ü4*9 where v: Z Yeu Y? (5 + y,) (29.7-3) 
vI=y zi 2 2 
For the square root use /1— y = (1 — y/2) - J/1 — (y/(y — 2))?, so 
2 
y 1 Yk 
l-y = II (1+ Y) where Yi := E. Yea c= EC (z) (29.7-4) 
k>0 
The relation for the inverse a-th root is 
1 -1/a 
A = (1-y Y = ][G+Y,) where (29.7-5a) 
TY k>0 
1 
Y, = a Yea (1 (1-0 Yi) (Ye + 1)°) (29.7-5b) 
Alternatively, 
1 —1/a 
— = sz(1-y) =x |[(1+Y) (29.7-6) 
Vd 50 
with y := 1 — d x° and the definitions ]29.7-5bj| for Yp. For the a-th root we get 
Yi=y = (1-y)/* = [[ (+Y) where (29.7-7a) 
k>0 
y 1 (1 +aYg) — (14- Yp)" 
Y := = Y, = 29.7-7b 
i a’ Pt a (1 + Yi)e ( ) 
29.7.2 Products of arbitrary order 
We want to find an n-th order product for the inverse a-th root 
1 
A = ][G@+7™)) where Yi =y, Yeu = N(%) (29.7-8) 
TY k>0 
The functions T and N have to be determined. Set 
a1 n -i 
[Ley e e AAT) son Ads glas (29.7-9a) 
= (+T) -YJ (29.7-9b) 


where 1 + T(Y1) is the Taylor expansion 


b (1+a) y? l (1 +a)(1 + 2a) y? 


n-1 

(+ 

k=1 ( a) y" 4 
a 2a? 6 a? 


nla” 


[1-7 = 1+ (29.7-10) 


up to order n— 1. The Taylor expansion of Ya starts with a term ~ y”. Using Y,41 = N (Yk), as suggested 
by the relation between Ya and Yj, gives a product with n-th order convergence. For example, for a third 
order product for 1/4/1 — y, set 

1 3 


T(Y) := 3 Pet (29.7-11) 
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Now solve (1 + T(Y1))? (1 — y) = (1 — Y2) for Ya to obtain 
5j? 15 y^ 9yř 


= i =: N 29.7-12 
Yo 8 64 64 (y) (29.7-12) 

Then, finally, (Yi := y and) 

1 
; = |[[@+7(%)) where (29.7-13a) 
TY k>0 
1 
T(Y) := + F (29.7-13b) 
y3 

Ya = NY) = = (40 + 15 Y, -9Y2) (29.7-13c) 


Replacing relation |29.7-13c|by Y;41 = 1 — (1+ T(Yx))* (1 — Yp) gives the general formula for the inverse 
a-th root. The second order products lead to expressions that are quite nice: 


: = I[a^70) where T(y) := "n and (29.7-14a) 
Vi-y k>0 j 
Yeu = N(%) := E 
Ra = N(W):1-(1«7) (1-9) (29.7-14b) 
1 y 
= 14 T(Y where T(y) := —— and (29.7-15a) 
Yang = NS 1-2) (l+y)-1 (29.7-15b) 
2/1 _ EN 
-y = [[@+7(™%)) where TQ) := -^ and (29.7-16a) 
k>0 
_ an (a—Y,)* — (1 — Yp) a° E (1— Yg) a” 
Yu = N(Y) := CEST =1 Az (29.7-16b) 
Vity = = +4 
+y = [[G@+7(™%)) where T(y) = +2 and (29.7-17a) 
k>0 
(14-Y&)a? —(a-- Yi) (14+ Yx)a* 
= L. — = 2 . -1 
Yk+1 N(Y) CSAL (a Yy)e (29.7-17b) 
The third order product for VC is 
2 
1 
: = [[GT*)) where T(y) = para and (29.7-18a) 
y1—y a 2a? 
k>0 
2 1+ a 
Me = N(Y) := 1 (1+2 ps Ci 2) (1— y) (29.7-18b) 
= 1-(1+T(y))*(1-y) (29.7-18c) 


29.7.3 Third order product for the a-th root 


The third order iteration given as relation |29.2-9 on page 571| gives a simple product for Vd. Let 


k 
P, := I[* (29.7-19a) 
j=0 


586 Chapter 29: Root extraction 


default (realprecision, 55) ; 

a=3;d=2;al=a-1;be=a+1; 

F(x)=(al*x*atbe*d) /(be*x*atal*d) NW == (x73 + 4)/(2xx"3 + 2) 
p=99.0; \\ very bad approximation to the root 

for (k=0,25,p*=F(p);print(" ",p);); 
49.50015304544986086777285375657013294857260641038853963 
24.75068869632253579329539807676065903819005885296493460 
12.37779278012838259998730922838288956373772022504401295 
.198681729467973308980394893535983190191783732016583057 
.138216095652577516458713512717521269860309193768721790 
.116643430397499239455514329236039539565820802853452486 
.283323514322784332830377116401015264599592858280490032 
.259926284571153279491359924753865826868163920866767105 
.259921049894873225007750979366564220753732750396518986 
.259921049894873164767210607278228350570251464701599790 
.259921049894873164767210607278228350570251464701507980 
.259921049894873164767210607278228350570251464701507980 


Uc 


PRA a pPHIPIAOAQO0) 


Figure 29.7-A: Computation of \/2 with a very bad initial approximation. 


where Yo is sufficiently near to Vd and Y, = F (Py 1) where 


F(z) :— V (29.7-19b) 


with a = a — 1 and B = a-- 1. Then Py = Vd. Figure shows the numerical quantities with 
the computation of 4/2 with a starting value Yo = 99 that is not at all close to the root. We have 
F(x) = cones which is © > for large values of x. Therefore the big initial values are repeatedly halved 
before the third order convergence begins. 


29.8  Divisionless iterations for polynomial roots 


Let f(x) be a polynomial in x with simple roots only, then 
(z) := x—pla)f(x) where p(x) := f'(r) ! mod f(x) (29.8-1) 


is a second order iteration for the roots of f(x). The iteration involves no long division if all coefficients 
are small rationals. Instead of dividing by f'(x) a multiplication by the modular inverse p(x) is used. As 
deg(p) < deg(f) we have deg(®) < 2 deg(f) — 1. 


For example, for f(z) = az? + ba +c we have 


2ax+b 
j A 


The general expressions for polynomials of orders > 2 get complicated. However, for fixed polynomial 
coefficients the iteration is more manageable. For example, with f(x) = x? +5a +1 we find 


(z) = f(x) where A=b?—4ac (29.8-2) 


G(r) = r+ I (302? — 9 + 100) (29.8-3) 
For the polynomial xz” — d we have p = x/(n d) and the iteration is (relation |29.3-13 on page 574): 
1 n+1 
(xr) = po (zx"—d) = z4— de (29.8-4) 
nd n d 


The construction is given in [185] where a method to construct divisionless iterations of arbitrary order 
is given: let p f/' -- q f =1 and 


$1 :— r—pnf, pı := p (29.8-5a) 
Pr oi ppLa-(r—l)apea (29.8-5b) 
r :— $$, | (-1)* p, f/r! (29.8-5c) 


then 9, is an iteration of order r+ 1. 
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Chapter 30 


Iterations for the inversion of a 
function 


We study some general expressions for iterations for the zero of a function. Two schemes for one-point 
iterations of arbitrary order are given: Householder's formula and Schróder's formula. Several methods 
to construct alternative iterations are described. Moreover, iterations that also converge for multiple 
roots and a technique to turn a linear iteration into a super-linear one are presented. 


30.1 Iterations and their rate of convergence 


An iteration for a zero r (or root, f(r) = 0) of a function f(x) can be given as a function (x) that, 
when used like 


Dk41 = D(x;) (30.1-1) 


will make x; converge towards the root: £e =r. Convergence is subject to the condition that £o is close 
enough to r. The function ®(x) must (and can) be constructed such that it has an attracting fixed point 
where f(x) has a zero: 


(fixed point) (30.1-2) 
(attracting) (30.1-3) 


A 
=. 3 


This type of iteration is called a one-point iteration. There are also multi-point iterations, these are of 
the form 2441 = (Tk, Ek—1;---,Lk-j) j > 1. An example is the two-point iteration known as the secant 
method 


tki = P(Tk, xk) = LE f (xr) (30.1-4) 


Fur) — f(zx-1) 
We are mainly concerned with one-point iterations in what follows. 


The order of convergence (or simply order) of a given iteration can be defined as follows: let x = r-(1+e) 
with |e| < 1 and $(z) = r - (1 + ae” + O(e"+1)), then the iteration © is called linear (or first order) if 
n = 1 (and |a| < 1). A linear iteration improves the result by (roughly) adding a constant amount of 
correct digits with every step. 


A super-linear iteration does better than that: The number of correct digits grows exponentially (to the 
base n) at each step. Super-linear convergence of order n should really be called exponential of order n. 
Iterations of second order (n — 2) are often called quadratic (or quadratically convergent), those of third 
order cubic iterations. Fourth, fifth and sixth order iterations are called quartic, quintic and sextic and 
so on. The two-point iteration relation [30.1-4] has order (V5 + 1)/2 ~ 1.618, see [186] p.152]. 


It is conceivable to find iterations that converge better than linear but less than exponential to any base: 
imagine an iteration that produces proportional to k? digits at step k (this is not quadratic convergence 
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which produces proportional to 2* correct digits at step k). That case is not covered by the ‘order- 
n’ notion just introduced. However, those (super-linear but sub-exponential) iterations are not usually 
encountered. In fact, the constructions used here cannot produce such an iteration. For a more fine- 
grained definition of the concept of order see [74] p.21]. 


For n > 2 the iteration function ® has a super-attracting fixed point at r: ©’(r) = 0. For an iteration of 
order n we have 


D'(r)=0, 9(r)=0, ..., 8% Y(r) =0 (30.1-5) 
There is no standard term for emphasizing the number of derivatives vanishing at the fixed point: super- 
attracting of order n might be appropriate. 


To any iteration of order n for a function f we can add a term f(x)" - p(x) (where y(x) is an arbitrary 
function that is analytic in a neighborhood of the root) without changing the order of convergence. Check 
the statement by verifying that the first n — 1 derivatives of $,, (x) + f(x)” - p(x), evaluated at the root r, 
equal zero. 


Any two one-point iterations of the same order n differ by a term f(x)" - y(x). 


Any two iterations of the same order n differ by a term (x — r)” v(x) where v(x) is a function that is 
finite at r [186] p.174, ex.3]. 


Any one-point iteration of order n must explicitly evaluate f, f^, ..., f(^- [333] p.98]. For methods to 
find zeros and extrema without evaluating derivatives see [74]. 


30.2 Schroder’s formula 


For n > 2 the expression 


-— n—1 " f(a)* 1 að k—1 1 f 
Snl) :— p» De (sis =) 7 (30.2-1) 


gives an n-th order iteration for a (simple) root r of f p.13]. That is, 


f P n f? Y pag 
S = Sols) = 2 — Te ape fü ps BFP FF") (30.2-2) 
4 
7 " (15 f"" _ 10p Pp + fer”) 
5 
E . (105 ^4 u 105 go pur 4 10 f2 f"? ES 1599) f" e TA u 


The second order iteration is the Newton iteration. A third order iteration (often referred to as House- 
holder's method) is obtained by truncation after the third term on the right side: 


f fF” 
$3 = r— p 1+ 9/7 (30.2-3) 
Approximating the second term on the right gives Halley’s formula: 
f io 
H; = z- p 1— E (30.2-4) 
Write 
2 3 n 
S = r-—U f U. f U: f ...— U, f ee (30.2-5) 


1 IP 2 2! F3 3 3l F5 n n! f'n-1 
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then U; = 1, Uz2 = f", Us = 3f'? — f' f". and we have the recursion (see [304] p.16] or [186] p.148]) 
Un = (n-3)f"U.a-fUL. (30.2-6) 


An alternative recursion is given in [333] p.83], write 


2 3 n 
& 2 n (4) Ya (5) Ya (4) Yn (4) me (30.2-7) 


then Y; = 1 and 


1 


Y, = (2 1) f 


2f! 


Relation |30.2-1| with f(x) = 1/2 — d gives the divisionless iteration |29.3-4| on page for arbitrary 
order. For f(x) = log(x) — d one finds the iteration |32.2-5|on page For f(x) = 1? — d we have 


x? —d (a? a)? (a? — ay? 5 (a? — a)* 
S(r = cr (5 + TE + 1675 + 198g? +... (30.2-9a) 


n 


Ya Ka) (30.2-8) 


24 
= sz—2z.(Y +Y?2+2Y34+5Y*4+14Y5442Y8+4...) where Y:= 5 
(27)? 


The coefficients of the powers of Y are the Catalan numbers, see section|15.4 on page 331 


30.2.1 Schróder's formula and series reversion 1 


(30.2-9b) 


We give three ways to derive Schréder’s iteration using (implicit or explicit) power series reversion. The 
reversion of the series 


A(z) = So anak (30.2-10) 
k=1 
is the series 
Bla) = X bka" (30.2-11) 
k=1 


such that A(B(2)) = xz. That is, B(x) = Al (x) is the functional inverse of A(x) (reversion is inversion 
with respect to composition). Note that AU !l(z) is not the same as A7*(x) = 1/A(x). A useful relation 


is given in [IT] p.634]: 
Are) = » (E) Gis) (30.2-12) 


xz=0 


Equivalently, from [213] p.527], 


kb, = k-u( = y (30.2-13) 


where [k — 1] Q denotes the (k — 1)-st series coefficient of Q. We use the expression to give a few terms 
of the reversed series explicitly: 


? n=5; R-0(x^ (n+1)); 
? A=sum(k=1,n,x"kx*eval(Str("a"k)))+R 
al*x + a2xx"2 + a3*x^3 + a4*x^4 + abx*x"5 + 0(x"6) 
? B=sum(k=1,n,x"k/k*(polcoeff (truncate((x/A)^k) ,k-1))) *R 
1/al * x 
(-a2/a173) * x^2 
((-a3*al + 2*a272)/ai75) * x^3 
((-a4*ai^2 + 5*a3*a2*al - 5*a2^3)/a1^7) * x74 
((-ab*ai1^3 + (6*a4*a2 + 3*a3^2)*a1^2 - 21*a3*a2^2*al + 14*a2^4)/a1^9) * x75 + O(x76) 


+++++ 
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The same result is computed by the built-in function serreverse(). Relation|30.2-13|can be generalized 
for the n-th power of the reversed series: 


TL x 


k 
k] (Ble)" = k-n (yy) (30.2-14) 


'This is one way to state the Lagrange inversion formula. 


30.2.1.1 Method of deriving the inverse function 


The starting point is the power series of a function f around xo: 
oo 1 ] 
f(z) = x xi f IP (20) (z — ao)" (30.2-15a) 


= f(xo)+ f' (xo) (x — zo) + 5f (xo) (£ — zo)? + f (ro) (x — zo)? +... (30.2-15b) 


Now let f(xo) = yo and r be the zero of f (that is, f(r) = 0). We expand the inverse function g = f^! 
around yo: 


g(0) = Y 9% (40) (0 — yo)" (30.2-16a) 
k=0 ^ 
= g(yo) + g'(yo) (0 — yo) + 5a (yo) (0 — yo)? + 5g" (w) (0 — yo)? +... (30.2-16b) 


Setting zo = g(yo) and g(0) = r we find 


r = zo—g'(yo)f(zo)+ 5 Qo) f (xo)? — ot" (w) f (zo)? +... (30.2-17) 


In order to express the derivatives of the inverse g in terms of (derivatives of) f, set 
fog = id, that is: f(g(zx)) = «x (30.2-18) 


and differentiate the equation (chain rule) to see that g'(f(x)) f'(x) = 1, so g'(y) = F FE Differentiate 


f(g(x)) — x multiple times to obtain (arguments y of g and x of f are omitted for readability): 


1 = fg (30.2-19a) 
0 = gf" + fg" (30.2-19b) 
0 = gf! c3f'f^g! + fg" (30.2-19c) 
0 = gf" x Af'g" f" c 3] g" + op gg + f gm (30.2-19d) 


This system of linear equations in the derivatives of g can be solved successively for g’, g”, g””, etc.: 


1 
f= 5 (30.2-20a) 
" f" 
"= -i (30.2-20b) 
yu E (ar^ u UU (30.2-20c) 
mmo 1 (10 nem n3 _ 12 gun 30.2-20d 
y" = o orm ism - pg (30.2-20d) 
nmn è a (1057"* - f J” quu o 105 f" f"? f" a 15 fff" a 107771?) (30.2-20e) 
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And so equation |30.2-17| can be written as 


Ce m : ;( 1) F : sm (ir^ n P+... (30.2-21a) 
2 3 
Tof " "e T ie ' (3/7 -= ff") ae (30.2-21b) 


which is Schróder's iteration, equation |30.2-2 on page 588 


30.2.1.2 Method of reversing power series 


Schróder's formula can be obtained as the reversion of the series 
Ps wy 
= SM = [1 — exp( -W 8)] o f (30.2-22) 


Let L(W) be the reversion of E(W), then x — L(f) is Schróder's iteration: 


? n-4; \\ up to order n 

? x-W; \\ kludge for GP’s variable ordering 

? E--sum(k-1,n,eval(Str("f"k))*(-x)^k/k!)40(x^ n*1)) 

fi*W - 1/2*f2*W^2 + 1/6*f3*W^3 - 1/24*f4*W^4 + OCW75) 

? L=serreverse(E) 

1/fi*W + f2/(2*f1^73)*W^2 + ((-f3*f1 + 3*f2^72)/(6*f1^5))*W^3 + N 
((f4*f172 - 10*f3*f2*f1 + 15*f2^3)/(24*f1^7))*W^A4 + O0(W^5) 
L-truncate(L); \\ make it a polynomial 
for(j-1,n,print(-polcoeff(L,j,x)*(f0)^j)) 


~N 


« oo-1O»OUu-D0 NA 
3 


~N 


10 -f0/f1 

11 -f0^°2*f2/(2*f1^3) 

12 (£073x*f3*f1 - 3*f0^3*f2^2)/(6*f1^5) 

13 (-f0^4*f4*f1^2 + 10*f0"4*f3*f2x*f1 - 15*f0^4*f2^3)/(24*f1^7) 


Here we use ‘fk’ as a symbol for the k-th derivative of f. 


30.2.1.3 Method of writing power series as operator functions 


Write the power series of the function f symbolically as 


2 fU gh 0 
T = = h— 2-2 
(f) 2 7 exp (+h 9, jos] (30.2-23) 
In this notation Schróder's formula becomes 
o 
S(x) = [exp (^^ ) o s 30.2-24 
@ | CT as PS 


First expand as a series 


a 
Say = XN E 2 OE (30.2-25) 


k=0 k=0 
h=f 
Now use 
o Ox O 1 8 
Of OF dz P»: (30.2-26) 


and separate the term for k = 0 to find 


eS RS Lu as di e 3 
S(x) = au? m (53) | = r+) il (5 =) P (30.2-27) 
Truncation gives relation [302-1] 
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30.3 Householder’s formula 


The following expression gives, for n > 2, an n-th order iteration for a (simple) root r of f [186] p.169]: 


(n — 1) (ris) t 


T 


We refer to iterations of this type as Householder iterations, the name Konig iteration function is used 


in |341]. We have 


Hala) :— x+ 


(30.3-1) 


H, = 0-5 (30.3-2a) 
" 12 
Hi = x SUS — 28") (30.3-2c) 


6f f' f" = 6f” = f?f" 
| Af (ey? - egre rpm) 
ai prn = 24 f1* a 36f f"? f" 8f? f! f" _ 6/2 fl” 


The second order variant is Newton’s formula, the third order iteration is Halley’s formula. 


H; = 


(30.3-2d) 


Following [196], we give alternative forms of Householder's formula. Define the iteration Bm for m > 2 


as 


Bm = t—f—* where Dg =1, D; =f’, and 
m-—1 

” (m—1) (m) 
P or dic Ex P 

: R (m—1) 

f f i E Ey 
Dm = det 0 f 7 : : 
fe 
21 
0 0 "- f "n 


(30.3-3a) 


(30.3-3b) 


The iteration is the same as Householder's (Ba, = Hn). A recursive definition for Dm is given by 


m 


. (i) 
3 } i—l pi— f 
Dm = (-1) f 5 il Dm-i 


i=1 


(30.3-4) 


The derivation of Halley's formula by applying Newton's formula to f /4/f" can be generalized to produce 


m-order iterations as follows: Let F, = f and for m > 2 let 


Fui 
Fm = VELA 
m-—1 
Fui 
Hm = == 
Baca 
An alternative recursive formulation is 
Q2 = 1 
1 
Om = f Qm-1 E fi s 
m —2 
Ho utut Qm 


(30.3-5a) 


(30.3-5b) 


(30.3-6a) 
(30.3-6b) 


(30.3-6c) 
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The Taylor series of the k-th order Householder iteration around f = 0 up to order k — 1 gives the k-th 
order Schröder iteration. 


An extraneous fixed point of an iteration for a function f is a fixed point at z such that f(z) 4 0. All 
extraneous fixed points for the iterations H, are repelling ([H(z)'| > 1), see [197] and [341]. 


30.4 Dealing with multiple roots 


The iterations given so far will not converge at the stated order if f has a multiple root at r. As an 
example consider the function 


f(z) = (2?-d)™ where meN;, (30.4-1) 
The iteration ®(x) = x — f/f’ is 
z?—d 
Dx) = xr- ms (30.4-2) 


Its convergence is only linear for m > 1: d(Vd(1+e)) = Vd (1 + m-1 e + O(e?)). 


0 


A second order iteration for a root of known multiplicity m is given in [186] p.161, ex.6] 


Ti 
e 


Note that with the example above we obtain a quadratic iteration. 


Da(1) =1—m (30.4-3) 


For roots of unknown multiplicity use the general expressions for iterations with F := f/f’ instead of f. 
Both F and f have the same set of roots, but all roots of F are simple. To see this, consider a function f 
that has a root of multiplicity m at r: f(x) := (x — r)™ h(x) with h(r) 4 0. Then 
fa = mir” h(x) 4 (x — r)" (a) (30.4-4a) 
= ag (m h(x) + (z — 1) Ii (v) (30.4-4b) 
and 


m h(x) + (a — r) h/(x) 


F(x) = f(x)/f(x) = (ur) (30.4-5) 


The fraction on the right side does not vanish at the root r. 


Substituting F = f/f’ into Householder’s formula (relation|30.3-1) gives the following iterations denoted 
by H Hu the iterations Hz are given for comparison: 


B s 2-2 (30.4-6a) 
HŽ = 2- (30.4-6b) 
H; = 7 (30.4-6c) 
HS = rtg ma de i E nz (30.4-6d) 
Hy = 242 m iU L E nz (30.4-6e) 
-———— 
FÉ API Pan 
Hs = 2 i A M i HE (30.4-6g) 


E FEF — 24f^ + 36f [2 f" — 8f2p' f — 6f2 p 
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The terms in the numerators and denominators of H ES and Hy, are identical up to the integral constants. 
The iteration H Hg can also be written as 


(log(f))*-? 


He = x+(k-1) (30.4-7) 
(log(f)) 
Schróder's formula (relation |30.2-1), when inserting f/f”, becomes 
/ 2 pr Pp 9 112 12 pn Ef! P(k 
e E RS. ay ss 
(a) a(ff" — f° SUIS i aa 
where P(k) contains derivatives up to fé, 
We check the convergence with our example (relation |30.4-1), the second order iteration is 
d— x? 2dx 
% | — 
p(x) S2 H> PPÉRLE Eid (30.4-9) 
Convergence is indeed second order, as we have (compare to relation|29.2-7 on page 570) 
l-e 1-e& 
S% | vd- = vd. —— 30.4-10 
a (va 1+ 3 vd 1+ e2 ( ) 
which holds independent of m. In general we have 
1-=é 1— e* 
p% | vd. = vd: 30.4-11 
5 (va 1+ j va 1+ ek ( ) 
Schröder’s third order formula for f/f’ with f as in relation gives a fourth order iteration for Vd: 
" d— a? (d — x”)? 
Sil) = ata qa +ad TENE (30.4-12a) 
1—e 1+ 3e? — 3e* — e$ 
? d = d 30.4-12b 
55 (vá) id 1 + 3e? + 3e* + ef ( ) 


1—c 4 +3 
= vdi where c= e 3e2 +1 


(30.4-12c) 


In general, the (1 + a k)-th order Schröder iteration for 1/ Yd with f/f’ has an order of convergence that 
exceeds the expected order by one. The third order Schröder iteration for f(x) = 1 — da? is 


1—de”  (1—du*y 


S^(x) = s+ gr F* Or ds 


(30.4-13) 


The iteration also has fourth order convergence and the error expression S% (+ L) is obtained by 


Vd 1+e 
replacing Vd with 1/Vd in relation |30.4-12b 


30.5 More iterations 


We give expressions for iterations via Padé approximants, radicals, and show how iterations can be 
obtained from given ones. Finally we give one form of a multi-point iteration. 
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exp(x) ="= 1/24*x^4 + 1/6*x^3 + 1/2*x°2 + x + 1 

P[4,0] = (1/24*x^4 + 1/6*x^3 + 1/2xx"2 + x + 1) / (1) 
P[3,1] = (4*x73 + 24*x^2 + 72xx + 96) / (-24*x + 96) 
P[2,2] = (1/4*x72 + 3/2*x + 3) / (1/4*x^2 - 3/2*x + 3) 
P[1,3] = (24*x + 96) / (-4*x^3 + 24*x"2 - 72*x + 96) 
P[0,4] = (1) / (1/24*x^4 - 1/6*x^3 + 1/2*x^2 - x + 1) 


Figure 30.5-A: Padé approximants P; j where i + j = 4 for the exponential function. 


30.5.1 Padé approximants 


Let $ = ae aj z^ be a power series. A Padé approximant Pi, of S is a ratio A/B of polynomials A 
and B with deg(A) = i and deg(B) = j such that A/B = S + O (z"*!) where n = i + j. Figure [30.5-A 
shows the approximants Pj; ; where i + j = 4 for the exponential function. 


Let Sn be a polynomial that coincides with a power series S up to (and including) the n-th power and 
assume that the constant term ag is nonzero. Then the approximant Pj, 4, can be computed with the 
extended GCD algorithm given in section on page For convenience we rewrite the EGCD 
implementation in GP: 


1 egcd(u, v)= 

2 \\ Same as the built-in bezout(u, v) 

3 

4 local(t, q); 

5 u = [1, 0, ul; 

6 v = [0, 1, v]; 

7 t = [0, 0, 0]; 

8 while ( v[3]!=0, 

9 q = (u[3] N v[3]); \\ division without remainder 
10 t=u- vq; u=v; v=t; 
11 ); 

12 return( [u[1], u[2], u[311] >; 
13 $ 


Now, following [213] ex.13, p.534], we can compute Pr, 4,4 With the EGCD routine with arguments S 
(up to the n-th term) and 2”*!, terminated after d steps: 


1  pade(s, d)= 

2 /* 

3 Compute Pade approximant A/B (of the power series S) 
4 such that deg(A)=deg(S)-d. 

5 Must have: d < deg(S); 

5 S must have a nonzero constant term. 

8 

9 


1 
local(n, t, q); 


10 s = truncate(s); \\ remove O(x^(n*1)) term if present 
11 n 7 poldegree(s); 

12 u = [1, 0, x^(n*1)]; 

13 v = [0, 1, s]; 

14 t = [0, 0, 0]; 

15 \\ PRINT v[3] / v[2] ( ==s/ 1) 

16 for ( j=1, d, 

17 q = (u[3] \ v[3]); MAN division without remainder 
18 t=u- vg; u=v; v=t; 

a \\ PRINT v[3] / v[2] 

21 return( v[3]/v[2] ); 

22 } 


If the first nonzero coefficient of 5 is az, then Pi j} = x Qii—k,j] where Q is the approximant for S/s*. 


30.5.2 Rational iterations from Padé approximants 


The [i, j]-th Padé approximant of ©, in f gives an iteration of order p = i+ j +1 (if n > p). Write Dj; j 
for an iteration (of order i+ j +1) that is obtained using the approximant [i,j]. For the second order 
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where the Newton iteration is ® g(x) = £ — £) this method gives one alternative form, namely 
[1,0] f 8 


E oa A _ e 
So, (x) = DTP =f Frat =f ie f = _ ( L) (30.5-1) 
For the third order we find ®p 9 (2) = S3(2), Pp 1 (2) = H3 (x£), and 
819,3 (2) a (30.5-2a) 
[0,2] 7 22 +f f? +a f? f" +202 f? ` 

af (2ff' taf" 2f") 
i (2f2f' + 2rf fl + aff! + 2x2 f?) VUA 

2 2gn-1l 
= c (1 | 4 | ds | fF) (30.5-2c) 

-1 

z E PP, fF 

= g (+ (s f) + Aa f^? + T 72) (30.5-2d) 


Alternatively we can use the Padé approximant Aj j of (D(x) —x)/f in f where $(z) is a given iteration 
of order > i + j +2. Then Sij := 2+ f+ Au, j} is an iteration which has order n = i + j + 2. 


2ff' 


Pío. (z) = x- af? ff" = H3(x) (30.5-3a) 
: (1774247) 
Dio (tr) = c YE = S(x) (30.5-3b) 
The iterations $^ of order n are expressions in z, f, f’,..., f(^-U. Fourth order iterations are 
A f (6 f^ apf? f" = f? f f” +3 f? gU 
Dhog) = cv oF” (30.5-4a) 
f UM " J LA Pg 
= ta apa | -gye (af? =f Ff") = Su) (30.5-4b) 
JL Ar ner) 
= T prop + 678 (30.5-4c) 
or E f (2 PPP 3 f" ul P 30.5-4d 
[1,1] (x) Ex f (IP f" - 6 f f”? + 6 f? f ( BE ) 
12 ff^ 
$0. ) (12 f-epf^fPa2ppq —3 PP q) (30.5-4e) 
qu af 27 p 
mE i / a ) (30.5-4f) 


f af 12^ 
'The iteration $7 o always coincides with Schróder's iteration. In general one finds n — 1 additional forms 
of iterations using the approximants [0,n — 2], [1,n — 3], ..., [n — 3, 1]. 
Neglecting terms that contain the third derivative in relation [30.5-4d] we find the third order iteration 
f£ (25^- pr) 
$4 = 2 F (2 p 2; f") (30.5-5) 
A closed form for the Padé approximants of the r-th root is given in relation [29.2-13] on page [572] 
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30.5.3 An iteration involving radicals 


By directly solving the truncated Taylor expansion 


Fr) = Fle) +f" @) (r= 2) + Ea (30.56) 


of f(r) = 0 around x we find the following third order iteration: 


dz = eren) Ls m (30.5-7) 


For f(x) = az? + bz +c this gives the two solutions of the quadratic equation f(x) = 0; for other 
functions we find iterated square root expression for the roots. 


The following form, given in [333] p.94], avoids possible cancellation: 


2u 
$4 = h = ' and A= f"/(2 30.5-8 
s = ee where ucd/f md A= f") (30.5-8) 
It can be found by observing that 
—b + Vb? — 4ac u —2c (30.5-9) 
2a — b+ Vb) = ac l 


30.5.4 Iterations from iterations 


Alternative rational forms can also be obtained in a way that generalizes the method used for multiple 
roots. We emphasize the so far notationally omitted dependency on the function f as ®{f}. The iteration 
®{ f} has fixed points where f has a root r, so x — ®{f} again has a root at r. Hence we can build more 
iterations that will converge to those roots as ® {x — ®{f}}. For dealing with multiple roots we used 
O{x— Po{f}}, = O{f/f'}. An iteration 6, {x — 6;(f)) can only be expected to have a k-th order 
convergence. 


30.5.5 A multi-point iteration 


A multi-point iteration of order r can be given [333] p.165] as 


f (9. (x) 
f'(x) 
where $, (x) is an iteration of order r — 1. For example, choose Da(x) = x — f(x)/ f'(x) to find the 

third order iteration 


D,(1) = ,_1(x)— (30.5-10) 


om La) 
Bs(0) = Boa) - —; 5 (30.5-11) 
Apply the method again to find 
aa = dj- ma = daa) - 7. na f ms (30.5-12) 
'The r-th order iteration is 
(£) = S: f A (30.5-13a) 
k=2 
= z- [f() Y f ula) (30.5-13b) 
FG) 2 


The function f is evaluated at r — 1 points but the derivative is only evaluated at x. The iteration also 
involves only one inversion. 


598 Chapter 30: Iterations for the inversion of a function 


30.6 Convergence improvement by the delta squared process 


Given a sequence of partial sums x; the delta squared process computes a new sequence x;, of extrapolated 
sums: 


(Zk42 — Tk) 


t = 1 30.6-1 
3 EP Lh42 — 2Xk41 + Tk ( ) 
The method is due to Aitken. The name ‘delta squared process’ is due to the alternative form 
‘ (Ax)? 
zye 30.6-2 
g =g (Aix) ( ) 
where A is the difference operator. Note that the algebraically equivalent form 
2 
Tk Tk4+2 — Ly 
gt = p e (30.6-3) 
Tk+2 — 2 Tk+1 + Tk 


should be avoided in numerical computations due to possible cancellation. 


If rk = s a; and the ratio of consecutive summands a; is approximately constant (that is, a is close 
to a geometric series), then x* converges significantly faster to x than x. Rewrite relation |30.6-1| with 
Ok :— Tk — Tk-1: 


* (ar42) 
AR a 6-4 
LL Le+2 a (30.6-4) 


For a geometric series (where az, 1/04 = q) we have 


2 k+2)2 
Q42) (ao q**?) 
TL = Tk42— MEINE = Lk 30.6-5a 
k +2 Oki lkal PAT (q+? — qe) ( ) 
pog k+2 ga ao k+3 k+3 ao 
= + 404 ATA = fag rg) ub (30.6-5b) 


which is the exact sum. Now consider the sequence of successively better approximations to some root r 
of a function f: 


To, t= ®(x0), Ta = (xı) =0 (S(x0)) 3 e. Tp= gik (zo) ; ... (30.6-6) 


Think of the xx as partial sums of a series whose sum is the root r. Apply the idea to define an improved 
iteration P* from a given one 9: 


[$((7)) — &(a)]" 
p* = (6 

(2) EE C T CLIENT 
The good news is that $* will give quadratic convergence even if ® only has linear convergence. For 


example, take f(x) = (x? — d)?, forget that its root Vd is a double root, and happily define $(r) = 
z— f(x)/f'(r) = x — (a? — d)/ (4x). Convergence is only linear: 


(30.6-7) 


2 
B(Vd-(1+e)) = vd: (1 d 5 4 i + oie) (30.6-8) 
'Then try 
. d (7 x? +d) 
o = — .6- 

(2) x (312 + 5d) (30.6-9) 

and find that it has quadratic convergence 

e e 

p* (va: (1+ e)) = vd. (1 Zu uu ot) (30.6-10) 


In general, if ®, has convergence of order n > 1, then 97, will be of order 2n — 1, but linear convergence 
(n — 1) is turned into second order, see p.165]. 
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Chapter 31 


The AGM, elliptic integrals, and 
algorithms for computing 7 


The arithmetic-geometric mean (AGM) is the basis for fast algorithms for the computation of m to high 
precision. We give several relations between the elliptic integrals that are special cases of hypergeometric 
transformations and AGM-based algorithms for the computation of certain hypergeometric functions. 
AGM-based algorithms for the computation of the logarithm are given in section [32.1.1 on page 622]and 
for the exponential function in section 32.2.1 on page 627] 


31.1 The arithmetic-geometric mean (AGM) 


The arithmetic-geometric mean (AGM) plays a central role in the high precision computation of loga- 
rithms and 7. The AGM(a, b) is defined as the limit of the iteration 
ak + bk 


an = 5 (31.1-1a) 


Dead = Vak bk (31.1-1b) 


starting with ag = a and bg = b. Both of the values converge quadratically to a common limit. The 
related quantity cj used in many AGM-based computations is defined as 


c= af —b? = (ax — ak? (31.1-2) 
We also have c44 = c2/ (4a%41) which is numerically stable but involves a long division. The quantity 
1 oo 
l E n 2 
RJ = 1-5 2. dà (31.1-3) 
will be used in the computation of the elliptic integral of the second kind. 


Another way for computing the AGM is the iteration 
ak + bk 


Qk41 = 5 (31.1-4a) 
—b 

Bn uer eS > k (31.1-4b) 

bei = agua — Char = V [enr + cr] lakti — 68i] (31.1-4c) 


31.1.1 Schónhage's variant 


Schónhage gives the most economic variant of the AGM, which, apart from the square root, only needs 
one squaring per step: initialize 


Ao = a$, Bo = bê, to = 1-— (Ao — Bo) (31.1-5a) 
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and iterate 


(31.1-5b) 
Sk = Aer Be (31.1-5c) 
bh = VBr [square root] (31.1-5d) 
ag = EE > be (31.1-5e) 
Aky = m [squaring] (31.1-5f) 
- p _ Ag+ Be, ALB CORE 
2 4 2 
Bra = 2(Anta—Sk) = bi (31.1-5h) 
Char = Anti Brp = akya — bya (31.1-5i) 
tu = d 24 (31.1-5j) 
Starting with ag = Ao = 1, By = 1/2 one has m = (2a2)/t,, 
31.1.2 Fourth order iteration for the AGM 
Combining two steps of the AGM iteration leads to the fourth order AGM iteration: 
a = ao, Bo = Vbo (31.1-6a) 
eae, A TA (31.1-6b) 
Bar = 4 ies) (31.1-6c) 
w = Gf, =k (31.1-6d) 


We have ak = aan, By = Vb2k, and yk = /Cor. Writing AGM4(a,b) for the common mean we have 
2 
AGM(a, b) = [acm (va, v/b)| . Another form of the iteration is: 


mm = E 3 B (31.1-7a) 

Gon E ek P (31.1-7b) 

Prot = Jot. aM ur V/ [o£ a t x4] [e£a Yel (31.1-7c) 

ar +2n41 = ak- l(a- ap)? (31.1-7d) 

The second identity for 6,41 replaces the computation of two fourth powers by two squarings and a 


multiplication. Compute R’ via 


R'(k) = ye (os - (252) } (31.1-8) 


n=0 


31.2 The elliptic integrals K and E 


The elliptic integrals K(k) and E(k) can be computed via the AGM which gives super-linear convergence. 
The logarithmic singularity of K(k) at the point k = 1 (relation see also relation 
is the key to the fast computation of the logarithm. The exponential function could be computed 
by inverting the logarithm but also as described in section [32.2.1 on page 627] For computations with 
very high precision the algorithms based on the elliptic integrals are among the fastest known today for 
the logarithm, the number 7, and the exponential function. 


31.2: The elliptic integrals K and E 


31.2.1 Elliptic K 
The complete elliptic integral of the first kind can be defined as 


-{" dt 
J/1—# sin? d o (1 —#?) (1 — k#?) 


Special values are K(0) = 5 and lim;-,1- K(k) = +00, and we have 


T 1 1 
K(k) = zr (+++) 
oo 2 oo 24 2 
= T » pci k "T CG) kt 
2 2: i! 24 4t 
1=0 i=0 
2 2 2 
T 1 2 1-3 4 1-3-5 6 
= gm [2 k? +... 
2 A (5) ev (23) T (543) 
T 1 9 25 1225 3969 
= 1+ _k? 4 "A k8 k19 +. 
2 ( 64" AR 256 ü 16384 + 65536 


31.2.1.1 Computation via the AGM 
The connection to the AGM is 


EE 
1 AGM(1, V1 — k) 


E b? a 1 
( 1 2) AGM(a, b) AGM(1, b/a) 
or, in terms of K(k) as 


T T 


2 AGM(1,k') — 2 AGM(1, V1 — k?) 


K(k) 
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(31.2-1) 


(31.2-2a) 


(31.2-2b) 


(31.2-2c) 


(31.2-2d) 


(31.2-3) 


(31.2-4) 


(31.2-5a) 


A C++ implementation of the AGM-based computation is given in [hfloat: src/tz/elliptic-k.cc. We 


define k’ = V1 — k? and K'(k) as K(k’): 


K(k) = K(vi-m)- E 


2 AGM(1, k) 


For k close to 1 we have 


The following estimate is given in [66] p.11]: 


4 
14) los q < 4k*(8+logk) where 0<k<1 


(31.2-5b) 


(31.2-6) 


(31.2-7) 
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31.2.1.2 Product forms 


Product forms for K and K' that are also candidates for fast computations are, for 0 < ko < 1, 


au a 2 ; — x Rs B 

- K (ko) = II i+k, = 1 (1 gu ki) where każi = 14 Tes kæ =1 (31.2-8a) 
der T 1 1+ kn 7 

— K'(ko) = Iz where kny1 :— >The ks = 1 (31.2-8b) 


The second form is computationally especially attractive since, apart from the multiplication with the 


main product, only an inverse square root needs to be computed per step. The product formulas follow 
directly from relation |31.2-5b| (and AGM(a, b) = a AGM(1,b/a) = b AGM(a/b, 1)): 


1 1+k et 
————— = |AGM| ——,vk 1.2- 
xs 7 ACIE) oe 
1+k 2Vk 
= [EEE pe y 2858 (first form) (31.2-9b) 
1+k 
1+k MD 
= |VkAGM (e i)| second form 31.2-9c 
| A ( ) (31.2.96 
Similarly, for 0 < ko < 1, 
2 a 2 r3 1- Kk, 
-K(k) = IL - II (Ls) where Enpi = pyg Boe 0 — (312-103) 
2 IT i 1— k 
— K(k = —— where k44,;1 := 1 T ko=0 31.2-10b 
; Kho) 11 ir +1 2 JE. ( ) 
31.2.1.3 Higher order products 
A product of order 4 follows from relation |36.4-6 on page 701 
2 
i r1 2 1— YT k, 
F E 2 | J = |[[ P| where 5,-—À > Mi lig) 
1 E 1+ Y1— kn 2 
ko — k, kn+1=[MnPrl!, ko=0 (31.2-11b) 
Another quartic product is 
1/2 


where W,=vV1=kn, Qu 41- kn = Wn,  (31.2-12a) 


1 1 
DE 
1 


JI 2 
n=0 


2 
RETA 
Qn (1 + Wn) 


This can be obtained by setting a = 1/2 in the following relation [222] rel.22, p.130]: 


m Q(L+W)\~* fa - 1-0 
r( m J = (EA) ee 8Q TW) where (31.2-13a) 


We=vi-z, Q=VW1-z (31.2-13b) 


ko =k, nai =—Rn (1—Qn)*, foo =0 (31.2-12b) 
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Here is a product of order 16 for 2 K(Vk) = AGM (1, vl-— k), compare to relations|31.3-25a|and|31.3-25b 


on page 607 
2 
(2 ? k II : where R, = (Pi — MÀ) ^, (31.2-14a) 
1 aig Pn + En 
1 n 1- n 
de A gie = , M= -— (31.2-14b) 
4 
P m Rn 
ko=k, kn 31.2-14 
0 +1 E ña id ( c) 
Some operations can be saved as follows: 
2 1/4 
1 1 E n (14-82 
F [pn | k) = iga where Ry = (1) , (31.2-15a) 
lcs 1 
n= (1 kn)”, ue 2, Xn = == 1.2-15b 
satn a zop 0 245b) 
ko =k, kna1 =[(Pa — Rn) Xn] (31.2-15c) 
31.2.2 Elliptic E 
The complete elliptic integral of the second kind can be defined as 
V1 242 
-{" V1—k? sin? dà = [53 TE NES dt (31.2-16) 
We have 
me uu. di 
E) = 3 ( Y | e) (31.2-17a) 


AN 
T 2 (24-117 k% TY [5 p^ 
= - = L 1.2-1 
2 > ( 2 ES 32 4) 2-1 RED 
2 2 4 2 16 
1:3\"k 1:3-5\"k 
2 
ye EEE aie 


5 46 175 ¿5 441 a0 
64" — 256 16384 65536 


(31.2-17d) 


Special values are E(0) = = and E(1) = 1. The latter leads to a (slowly converging) series for 2/7: 
= ES a fa) (31.2-18) 
T 1 

Similarly as for K’, one defines E” as 


E(k) := E(k’) = E(Vv1-#) (31.2-19) 


The key to fast computation of E is the relation 


E 1 
K 2 Yu E (31.2-20) 
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The terms c’ in the sum occur naturally during the computation of the AGM, see relation |31.1-2 on 
page 599| One defines 


R := 2 R := = (31.2-21) 
y mt K’ .— K' 27 
Then E can be computed via 
T = 2 
E(k) = R(k) K(k) = -{1 P e cin 31.2-22 
O = ROKE = SoMa, V1 - | » vega 
Legendre’s relation between K and E is (arguments omitted for readability, choose your favorite form): 
E E T T 
—L——1 = — EK'+E'K-KK' = — 31.2-23 
K = K' 2K K"’ * 2 ( ) 
Equivalently, 
E/K R 
AGM(1,k) = / = (31.2-24) 


(1 — E’/K’) (1 — R’) 
For k = + =: s we have k = k’, thereby K = K' and E = E’, so 


K(s) ES qa!) _ > (31.2-25) 


RI 


T T T 


As expressions |31.2-5a|and|31.2-22|provide a fast AGM-based computation of x and E the above formula 


can be used to compute 7. 


The following expressions for the derivatives of K, K', E, and E' allow fast computation (see [124] p.75]): 


dK E-k?K dK’ | kEK-E 
E e - 1.2-2 
dk kk > Tk Ek” Leon) 
dE E-K dE'  k(K'— E) 
— = = 31.2-26b 
dk ko —— dk KP ( ) 
31.3 Theta functions, eta functions, and singular values 
The theta functions O5, O3, and Oy are defined as 
O2(q) = y int” = 9 yq enum = 244 Sight (31.3-1a) 
n--—oo n=0 n=0 
Osa) = Mq" =14+2)' q" (31.3-1b) 
n--—oo n=0 
ea = Y ye” = 142) ("4% (31.3-1c) 
n-—-—oo n=0 


These are the expressions for z = 0 of the more general form in two variables, see [1] sect.16.27, p.576]. 
'The following relations hold for the theta functions: 


03(4) + ei (a) 


O$(q") a (31.3-2a) 
ej(qd) = 1y03(4) O(4) = Os(q) Oa(a) (31.3-2b) 
SUN) Sita) - Gao) (31.3-2c) 


2 
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Using the first two relations (and limpo O3(q”) = lim, 55; 94(q”) = 1 for |g| < 1) we see that 
AGM (05(4), 04(4)) = AGM (O$(q?), O1(q?)) = AGM (O3(q*), O1(4*)) = AGM (O3(4*), O7(q°)) — ... 


1 = AGM (03(q), 04(4)) (31.3-3) 


By the linearity of the AGM in both arguments we have 


1 o) 
= AGM|[L- 31.3-4 
et) (ag ad 
The relation can be identified with 3 = AGM(1, k’) if k and q are connected via 
K' 
= rr 1.3- 
a = e(=1 7) (31.3-5) 


The relations between the theta functions and K, k, k' are 


2kK 2K 2k K 
ola) = ESE O$(q) = x ei(a) = = (31.3-6a) 
We also have log (1/q) = —logg = 2 K'/K, so 
T T AGM(1, k) 
= = AGM (0?(q), O? = — => 31.3-7 
log(1/q) log(q) (30, 90) = Sama. 35 ole!) 
Identifying a, = O2(g) and by = O2(q) in relations [31.1-1a] and 31.1-1b on page 599| we see that az] = 
O2(g?) (via relation 31.3-2a) and bx +1 = O2(q?) (via relation |31.3-2b). That is, the AGM sends q to q?. 


We can also identify cj, = O2(q), then ck+1 = 02 (42). 


31.3.1 Relations for the theta functions 
We give identities involving fourth powers, the first is Jacobi’s identity, it follows from |31.1-2 


O3(4) = O4(q) + O%(a) (31.3-8a) 


e$(q) + 94(q) 


2 = e$(g)-e$(g)- 203(4?) - 010°) = Oí(4?) +202(4?) (31.3-8b) 


a 
O,(q*) in|31.1-6c| The 4th order variant of the AGM sends the pair (5(q), O4(q)) to (O3(4*), O4(q*)): 


Now opi ak = O3(q) and akı = Os(q*) in relation |31.1-6b on page 600|and 8; = O4(q) and Br+1 = 


os) = +O) _ (HTH TT 
out) = (exo ex) ST 919... oer acq; (31.3-90) 
We also have yk = O»(q), Yr+1 = O2(q*), and 
Ox(q) = Sata) eu) (31.3-10) 
The following expressions can be verified using relations [31.3-2a] and [31.3-9a] 
3204(4) = [Os(4) + Oa(@)I* [05(a) + ©4(@)] Os(a) Oa(a) (31.3-11a) 


1604(4*) = 1603(47) — O3(4) (31.3-11b) 
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Repeated application of the identities 


Os(g) = Os(q*)- Ox(q*) 
Orla) = Os(q*)- O»(q?) 
gives 
Os(g) = -Ox(q*) + Os(q19) + Oo (q9^) +... + Oo (g^ ) 4 
O4(g) = -O»(q*) + Ox(q19) + Ox (q95) +... + O2(q" ) - 


These are also valid in the limit N > oo (lim, O3(q”) = 1, so): 


Os(qg = 1+ O2(g*) + O2(q**) + O2(q™) + O2(q?%) +... 
O4(q) = 1- O»(q*)- O2(q"%) + O5 (49^) + O5 (q796) +... 


From [206] p.35, ex.6], attributed to Gauss: 


eig) = 145 08a) +5 [e3t??) + OX(a4) + 04l") + O40") 
ei = 1- 040) +5 (ON?) + O40") + 3) + eda) 
These can be obtained from the following: 
Pesa) - esa] = BOP) - o] +3040?) 
[204(4) + e2()] = (2044?) + O2(4*)] + 302(4”) 


(31.3-12a) 
(31.3-12b) 


(31.3-13a) 
(31.3-13b) 


(31.3-14a) 
(31.3-14b) 


(31.3-15a) 


(31.3-15b) 


(31.3-16a) 
(31.3-16b) 


The following relations are a consequences of|31.3-9a| and |31.3-2a| they are valid only for finite N: 


Os(q) = 2"Os(q^)- [Oa(a) +204(0*) - 464(99) +... +21 x(q] 


o$(g = ex”) - [e3(o) 20308) + 402(4) +... 2" es" 7) 


From (a factorization of |31.3-8a] 


e$() = O3(4”) + O3(4") 
04 (q) es(q?) - esq?) 
we obtain 
e$) = 057”) + [+03(a?) + 03a) + 63 (a) 
= 1+0%2(47) + O2(4*) + O3(q)) + O2(4**) + 
eie) = 032”) + [-©3(q") + e) + 030") 
= 1-O$(q^) + 03(4*) + 034%) + G2 (q^) +... 
We give two relations from [66] pp.110-112]: 
O4(q) O4(q?) + O2(4) O2(q°) = O3(q) Os(q?) 


v/O4(q) 94l") + /O»(q) Oz(q") =  VOs(q) O3(q”) 


(31.3-17a) 


(31.3-17b) 


(31.3-18a) 
(31.3-18b) 


31.3-19a 
31.3-19b 
31.3-19c 
31.3-19d 


Woe Ne a MEE 


~ 


(31.3-20a) 
(31.3-20b) 
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31.3.2 Corresponding transformations of q and k 


The map q + —q swaps Os with O4 and sends k' — 1/k' and k >> 4 


-ik/k'. The map q+ q? corresponds 


to the maps (compare with relations |31.2-10a| and |31.2-9a on page 602) 


kg e ko EE, wone = OYE (31.3-21) 
For the map q — q^ we have 
Vk ee Vk > l Lv a ote) (31.3-22) 
1+ Vk’ 1+ Vk’ 1+ Vk’ 
the map q > q’® is obtained by applying q — q^ twice. Define 
p — 2369) + Calg) gg = St) > 9«(q) (31.3-23a) 
R = YPM = (9x9 O4(4) a + ei(a)] (31.3-23b) 
Then we have, in terms of the theta functions, 
eyq*) = E > i (31.3-24a) 
Os(q^) = LLL (31.3-24b) 
SAUS) i PRI (31.3-24c) 


Adding the first two relations gives O3(q1%) + O5(q!6) = 3 [O3(q) + O4(q)]. The maps for k and k’ are 


— ps Mt 
Vk = ean where P=1+Vk', M=1-vk' (31.3-25a) 
VEPR(P2+R? " 
Vk i E e e V P4 — M^ (31.3-25b) 
31.3.3 Relations for the eta functions 
We define the eta functions y and 7, as 
oo , oo ; 7 (a?) 
na :=][ 1-4, m@=][ G+) = "T (31.3-26) 


Note that 7 is not the Dedekind eta function, which is q!/?* Ms (L= q?), see entry “Dedekind eta 
function”]. We write Ey for n(q*) where convenient. Expressions for the theta functions are 


6.4) = 2977 = 2440) = VR) Gama) 
Ej _ w(-a9 _ ny na) i 
= HE Mé) na) ma) poe 
_F_r@_ @ — ng) gag, : 
O49) = E, - «qm uo "9 SES) 
291 (a?) = ©2(q) O3(q) O4(q) = O2(4) Ola”) (31.3-27d) 


608 


Expressions for K, k, and k' = v1 — k? are 


2K E 2k K 2. /q Ej Ak K El 
AE er Esc — @2(o) = 4 — o/a — fA 
- a(q) EDED - 2(q) EZ > = 4() E 
O3(a) E EQ Ej Ej na)” 
k mm 2 m 4 q 4 E 4 q 1 4 4 q + 
eto VTA Vi pp = VT (ug) 
4 4 
y 2 C8) _ ES BEC En | _ 2] 
©3(4) Ej EY n(—q) 14 (q) 
The first equalities of the following six relations are taken from [356] p.488]: 
= -— 256 k? kS K1? 24 24 
Ha- = gg 7 (q) = Ei 
II e m pane z 16q k'* 1 u n” (q) u ED 
i RG) PC) LE 
oo "^ 16 k^ ka K”? " 4 
II (1-4?) = geal = 144) = Ez 
n=1 
oo " k2 n24 q? pI” 
[[a+e")* = a = (q) = x 2m 
zu 16 qk nt (q) Ex 
oo 24 
Were E ud e A 
n=1 k? k”? "i (a?) E Ea 
= kt E, 
1 2n 24 — 24/42) — | 4 
11 ( +4 ) 256 q? L2 n4 (q ) Es 
Replacing q — — q, k? — 1/k?, and k? > — k?/k”? in|31.3-29d] gives 
Ia«com* = £9 - y = LO - CORO - E Es 
m 16q i "n2 (a) n8 (a?) Ej 
Multiplying |31.3-29c| and [31.3-29f] gives 
Y An\24 k8 k? K”? | „247 Ay p24 
II (1-q ) W6qn2 = n (g) = Ej 
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(31.3-28a) 
(31.3-28b) 


(31.3-28c) 


(31.3-29a) 
(31.3-29b) 
(31.3-29c) 
(31.3-29d) 
(31.3-29e) 


(31.3-29f) 
| Ñ (31.3-29g) 


(31.3-29h) 


Most of the following identities can be found by rewriting other relations (typically involving the theta 
functions or k) in terms of the eta function. 


k 


VE 


4 yq E2 E$ Ef 
EP +4q BA ES 
EP — 4q E4 E$ 
EP +4q BS ES 
ES = 2q Ej Efe 
E$ + 2q Ej Efe 


(31.3-30a) 
(31.3-30b) 


(31.3-30c) 


For the numerators and denominators in the last two equations we have (see section [37.2.4|on page 


for more relations of this type) 


12 4 4 p2 p2 p4 12 4 E3* E; 
Ey —AqEjEg = Ej} E3 Ej Es, Ej +4qE¿ Ez = ELE 
1 £4 
E? E? Es E? EX Es E 
EPOMEPEM = Dg o BB +2qEi Big = Ug. 
1 


(31.3-31a) 


(31.3-31b) 
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The first identity can be obtained by multiplying the last two and replacing q? by q. Multiplying the 
first two relations gives a relation equivalent to k? + k'? = 1, we give several forms: 


E% —16q E$ E} = EFE (31.3-32a) 
m(-q)-w(q) = 16qm (q^) (31.3-32b) 
nta) -"(-« = 16990 (47) = 1699 (a) n? (-a) (31.3-32c) 
nf. (P) +16q9 (P) $. (a) = net) (31.3-32d) 
The identity is also the product of the following two relations: 
2E En 
E} — E E] = BMEJEZE¡ES, 2)°+ ES Et = ==> (31.3-33) 
Ez Eg 
Their ratio is 
Ek? — EE 4q Eż ES 
— = = Ha) (31.3-34) 
Ej + Ey Ez Ej 
'The numerator is the product of the first two of the following identities: 
ia 2 E? E, EŠ 4 r2 4q E? E, E? El 
E$ + EE? = >, E$ — Ei E? = AAA (31.3-35a) 
Eg Es 
Ej — Ej Ej 2q Ej Eig 
= = /k(q4 31.3-35b 
BS + El E? ES (^) ( ) 
We rewrite the first two relations, and multiply to obtain an expression for k'(q?): 
2 E? ES Es El E, El, 4q E? E} E, E2 Eg 
6 4p2 — 5 > 6 4p2 — 2 (31.3-36a) 
Ey + ELE; Es Ez — Ej By Ey Eig 
2 Et ES E? ES Ef 
= = k(g 31.3-36b 
EP + ES EÀ EP (1 ( ) 
31.3.4 Singular values 
n: kn minpoly( kn^2 ) 
1: 0.7071067811865475244008444 2*x - 1 
2: 0.4142135623730950488016887 x72 - 6*x + 1 
3: 0.2588190451025207623488988 o. 2 - 16*x + 1 
4: 0.1715728752538099023966226 - 34*x + 1 
5: 0.1188769458026001011927468 16*x^4 - RE 3 + 88*x72 - T2*x + 1 
6: 0.0851642331747425876487993 x^4 - 140*x73 + 294*x^2 - 140*x + 1 
7: 0.0626229125431679701026646 256*x^2 - 256*x 1 
8: 0.0470218995009911016709143 x^4 - 452*x^3 - 122«x^ 2 - 452*x + 1 
9: 0.0359215682038989341106255 16*x^4 - 32*x^3 +_792*x72 - 776*x + 1 
10: 0.0278424447445495031701183 x^4 - 1292*x^3 + 2598*x^2 - 1292*x + 1 
Figure 31.3- A: Singular values k,, for n < 10 and minimal polynomials of k?. 
For every n € Ny there is a unique kn (0 < kn < 1) such that 
K' (kn) E 
= yn 31.3-37 
Kb) Mg 


The value of k, can be computed by setting q = exp(—7 yn) in relation |31.3-28b| or by solving 
AGM(1,/1-— k2)/AGM(1, kn) = yn. The values kn are algebraic (over Q). A few examples are 
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(find more values in [349] entry “Elliptic Lambda Function” ]): 


— L (31.3-38a) 
kg = v2-1 (31.3-38b) 
ky = A = ; 2— J3 (31.3-38c) 
ka = 3-22 (31.3-38d) 
ks = ; V5—-2- ; (Vvs-1- Vs- vs) (31.3-38e) 
ke = (2 = v3) (v3 = v2) (31.3-38f) 
kr = ¿2 (3 = v7) - ; 8-3v7 (31.3-38g) 


The degree of the minimal polynomial of k2 can (for n > 2) be computed in the GP language as 
2*qfbclassno(-4*n). The sequence of the degrees of the minimal polynomials for k? starts as 


ag! 2222222322 15 13 13 18 13 18 18 13 18 78 74 72 78 78 73 19 70 78 12 78 


This is twice entry A000003 in [312]. The first few minimal polynomials are shown in figure |31.3-A| If 
the degree of the minimal polynomial is small, one sometimes can solve it in radicals even for large n. 
For example, with n — 163 we have 


1 2U? 
kis = TE where (31.3-39) 


U = 80040, r= if + 557403 V3 - 163 


and k163 © 7.80664428497433 - 107°. The quantity 


P, E s (31.3-40) 
nm i= — log | — .3- 
um NET 
is an approximation for 7. For example, m — Pjg3 = — 2.38 - 10718, Better approximations are given by 
Al 
Pai = Fi log (wn + 8w2 + 84w? +992 w? +... + c; wl) (31.3-41) 
n 


where wn = k2/16 and the coefficients c; are given as entry A005797 in [312]: 
1, 8, 84, 992, 12514, 164688, 2232200, 30920128, 435506703, 6215660600, 89668182220, ... 


For example, T — Par = —2.15- 1071? and r — P634 x — 2.06 - 10767. 


Certain values of the gamma function can be expressed in terms of evaluations of K at singular values 
(taken from [339] p.12]): 


T(1/3) = w'/897/93-1/12 K (g, 4 (31.3-42a) 

T(1/4) = w/42K (k) (31.3-42b) 

T(1/8) = «1/8218  (k,)/* K (kg)? (31.3-42c) 
1/4 

T(1/24) = 1/24 980/36 325/45 Y V3 41 (V3 — 1) K (kj) ^ K (ky)? K (ks)? — (31.3-42d) 
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31.4 AGM-type algorithms for hypergeometric functions 


We give AGM-based algorithms for the hypergeometric functions 


a 1 1 
F ¢ E ve :) where s€ fo, a 5) (31.4-1a) 


1 
5 qp 
1 1 
i ti+! 1 1 
4 Lon e —,- 1.4-1 
r( l D where TIPS (3 b) 


These are taken from [69] and [151], both papers are recommended for further studies. See also [241], [68], 
[97], and [96]. The limit of a three-term iteration as a generalized hypergeometric function is determined 
in [217]. A four-term iteration is considered in [67]. 

The following transformations can be applied to the functions, these are special cases of relations|36.2-12a 
and |36.2-12b on page 690 


1 1_ lis 1_8 1 
F [* i si :) = F (* E 2 Az(1— 3) where |z| « a (31.4-2a) 
1 1_ lj l. —AJdez 
p(3*53 “la sy a aa a (31.4-2b) 
1 1 2 
31.4.1 Algorithms for F (/?,"?| z) [1/240] 
The following is relation |31.2-5a on page 601} the classical AGM algorithm which has quadratic conver- 
gence: 
1 1 
F p :) = 1/M(1,v1-2) where (31.4-3a) 
M(a,b) := [a +b)/2, Vat (31.4-3b) 


We write the AGM as M := [f(a,6), g(a, b)] in the obvious way. 


Compare to the following hypergeometric transformation 


i 1 11 
ES 2) = a+ r (3? 2 (31.4-4a) 
where 
1-(1-2')\/2 i ta 
z un A E =1- 1.4-4b 
I«x(i-345j02*  * IFz 8 ) 


It is the special case a = 1/2 and b = 1/2 of the transformation 
a, b 4z a, b 1-zM a,ü—b4-1 
P(o FU h-E) aar” 2/22) (31.45 
e ax] v G=) ta | bei d^ (ele) 
A fourth order algorithm is found by combining two steps of the classical AGM: 


de. :) = 1/M(1,V¥I—z) where (31.4-6a) 
M(a,b) := [(a+0)/2, Vab (a? 9572] (31.4-6b) 
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31.4.2 Algorithms for F ub di z) [1/2+1/6] 


A third order algorithm: 


1 1 2 2 
F a «) = 1/M(W1-z,1) = V1-zF pn 3 2) where (31.4-7a) 
M(a,b) := [a +20)/3, $/b (a? + ab + B2) 73| (31.4-7b) 
We further have 
1 2 
p [s «) = M (1, V1- z) (31.4-8) 
A quadratic algorithm: 
1 2 
F pr | 2) = 1/M (1, Y1—2z) where (31.4-9a) 
1 1 
M(a,b) :— ls (V2»- a3 + V/2m — as) a (p + 2] and (31.4-9b) 
p +t, m := b? —t, t := ydó—a3b3 (31.4-9c) 
And again (see relation |31.4-7a): 
1 1 2 2 
is 2) = 1/M(YI=2,1) = vr ($, 3 :) (31.4-10) 
We note the following hypergeometric transformation due to Ramanujan: 
1 2 1 2 
ke 2) = atzar (+ 5) (31.4-11a) 
where 
[eti xus f tag? 
= — — —1- 1.4-11 
í 1-2ü-2)/$9 f 1422 Ee tp) 
The general form is given in A 
oett (12 Y "c 
r( " h- (3) = (1422) F seas | (31.4-12) 


For c = 1/3 we obtain relation |31.4-11a| A computer algebra proof that relation |31.4-12| is the only 
n[B1-4-11a| 


possible generalization of relation |31.4-11a|is given in [216]. 


An alternative quadratic algorithm is 


i d 
gs 2) = 1/M(1,W) where (31.4-13a) 
(a,b) :— [(a+0)/2, (3 vor (b+ 2a)/ 3-b) / 2] and (31.4-13b) 
= 2 1/3 
W = = R := | "LESEN , uv: 1—27 — (31.4-13c) 


It is given in the form 


r(*5a-aaezm) = uan (31.414) 
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A product form can be derived from [47] Theorem 6.1]: let 


2(3+ 2)? 
= —_ 1.4-1 
a(z) 202) (3 5a) 
2 1/3 
p(z) :— pne where r:= [22 +222 — z — 1] (31.4-15b) 
r 


then, with to := 1 — z and ty44 := alp(tz)), 
33 14 pt) ] 
33 - 
F ( : | :) = II : (31.4-15c) 


The function p(z) is the real solution of p(@(z)) = z where B(z) := (z?(3 + z))/4. 


z) [1/2 1/4] 


31.4.3 Algorithms for F ae 


A quadratic algorithm: 


1 3 
F | :) = 1/M(1,VI=2)" where (31.4-16a) 
M(a,b) := [a +3b)/4, /b (a + 0)/2| (31.4-16b) 
We further have (note the swapped arguments in the mean) 
ii ij 3 3 
dE :) = I/M(Vi-z1)" = var (#8 :) (31.4-17) 
Now set Ap := y/(az + bk)/2 and By := Vbi, then 
1 
Aka] = 3 (Ax + Bx) (31.4-18a) 
Bud = VAB, (31.4-18b) 
This is the iteration of the classical AGM: 
b 
M(a,b)'/? = AGM | = D (31.4-19) 
Equivalently, we have 
1 3 db d 
ge :) = 1/AGM > o (31.4-20a) 
l1 leo 
F cy z) = 1/AGM — 1 (31.4-20b) 
Compare to the hypergeometric transformation 
i 3 i 3 
pu 2) = var 2) (31.4-21a) 
where 
_ _ 4f)1/2 
2. 1-(-2) (31.4-21b) 


1+3 (1 = gs 


2 
Y = 1- (= ) (31.4-21c) 
zZ 
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It is the special case d = 1/4 of the transformation 


d,d+i d,d+i 
r( ie J = assa e( is 2) (31.4-22) 
3 6 
Various such transformations and their generalizations are given in |152]. 
31.44 Algorithm for F hi odd z) [1/4 1/12] 
A quadratic algorithm is 
i 
F [s 3 :) = 1/M(1,W)'/? where (31.4-23a) 
M(a,b) := [(a +30) /4, (Va + b) /2| and (31.4-23b) 
2 1/3 
W := m R = 2(2-2)^-2241 (31.4-23c) 
It is given [69] p.515] as 
1 1 2 
E p | ara (b 2)/4)| = 1/M (1, x) (31.4-24) 


Note that R in general has a nonzero imaginary part, but W is real for real z. 


2) [1/4 + 1/6] 


The following algorithm has quadratic convergence, W is defined by relation |31.4-23c 


31.4.5 Algorithm for F Gag 


1 5 
F ( MES :) = 1/M(1,W)” where (31.4-25a) 
M(a,b) := [a + 155)/16, (v^ (a + 3b)/4 + b) /2] (31.4-25b) 
The next relation is the special case a = 1/6 of relation|36.2-20e on page 691 
5 5 1/6 m ia |_—42 
= (-2)V$p([1P | 14-2 
F ( :) (1— z) ( 1 = 5) (3 6) 
The following relations are given in p.17]: 
11 1 1/12 /} 5) 17282 
pt | TA = 169? F| 12? 12 31.4-27 
( 1 3 E EP 1 | (+16) ( a 
it) z 1 2 CHAM db em 1728 z 
F| 3 3) — — = = 3 + 27 F| e | — 31.4-27b 
( 1 5) E epe ] ( 1 (243) crm) ( ) 
ld 1 “ME bdo 1728z(z+16) 
23| ^7 "ME +4 16)3 F| | 14-2 
ij ( 1 2) E Ara | ( 1 | (1624 x) (ode) 


We finally give a curious transformation which follows from [242] (entry N = 5 of table 12 on p.32, 
together with entry N — 2, M — 5 of table 17 on p.43): 


d. 2x 1.5 
MY F ( Br | A) = 2MY*F ( aoe As) where (31.4-28a) 
Mi = 2—42°4+ 2562+ 256, Mg = 2°—42°+162+4+16 (31.4-28b) 


(z—4)2°(z4+1)? An = 1728 (z — 4)? 219 (z 4- 1) 


A = 1728 
L M$ , L M3 


(31.4-28c) 
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31.5 Computation of 7 


We give various iterations for computing 7 with super-linear convergence. The number of full precision 
multiplications (FPM) is an indication of the efficiency of the algorithm. The approximate number of 
FPMs with a computation of m to 4 million decimal digits (using radix 10,000 and 1 million LIMBs) is 
indicated as, for example #FPM=123.4. 


31.5.1 Super-linear iterations for 7 


AGM implemented in [hfloat: src/pi/piagm.cc|, #FPM=98.4: 


1 
ao = 1, bo = Y (31.5-1a) 
b 
dis xus 2 ; k (31.5-1b) 
Dr+1 = Vak bk (31.5-1c) 
2 0241 
Phn = -= ^m (31.5-1d) 
1— 3o 2* Ch 


2 gnt4 —q2^*1 
nop 2 Dee (31.5-16) 
AGM (ao, bo) 
Convergence is second order. Computing 7 based on the fourth order AGM (relations|31.1-6aļ . . |31.1-6d 
on page 600) is possible by setting the second argument of the routine (#FPM=149.3 for the quartic 
variant). Schónhage's variant of the AGM computation (relations |31.1-5a| .. |31.1-5j on page 600) is 
implemented in [hfloat: src/pi/piagmsch.cc| (ZZFPM-—78.424). 
The AGM method goes back to Gauss, a facsimile of the entry in his 1809 handbook 6 is given in [19] 
p.101]. The entry states that 
AGM(1,k) AGM(1, k’ 
= Q, ur ) (31.5-2) 
1— 3o 2*7! (e + c'h) 
where k’ = bo/ag and k = \/1— b2/a2 = co/ao. For k = k' = 1/v2 one obtains relation |31.5-1d| The 
formula appeared also 1924 in [206] p.39]. The algorithm was rediscovered 1976 independently by Brent 
(reprinted in p.424]) and Salamin [294] (reprinted in p.418]). 


AGM variant given in [64], [hfloat: src/pi/piagm3.cc!, #FPM=99.5 (#FPM=155.3 for the quartic vari- 


ant): 
a = 1, b= a (31.5-3a) 
2054 
Du = = = > T (31.5-3b) 
V3 (1 — pao 2* cz) — 1 
2 gnt4 —V3r2"tt 
T — Pn ven 7 (31.5-3c) 


AGM? (ao, bo) 


AGM variant given in [64], [hfloat: src/pi/piagm3.cc, #FPM=108.2 (#FPM=169.5 for the quartic 


variant): 

6 — V2 

ao = 1, b= HE (31.5-4a) 
Gas i 

nem a > T 31.5-4b 
^ = JSü-Yzaosdg +1 gor 

4 gn AA 
T— Pn < (31.5-4c) 


AGM(ao, bo)? 
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Second order iteration from [66] p.170], [hfloat: src/pi/pi2nd.cc|, zFPM- 255.7: 


1 1 
yo = Va’ w% = 3 (31.5-5a) 
La 
Yet = „0+ (31.5-5b) 
LE cp 
= (1-19 7-1 (31.5-5c) 
Gu +1 
I 
agii = Oe (Lt yng)? AP yk > = (31.5-5d) 
ap— r7! < 16.999727 (31.5-5e) 


Relation |31.5-5c| shows how to save 1 multiplication per step (see section |29.1 on page 567). A simple 


proof of this iteration is given in [173]. 


Borwein's quartic (fourth order) iteration from [66] p.170], variant r — 4, implemented in [hfloat: 


src/pi/pi4th.cc|, ZZFPM-170.5: 


Yo = v2-1, ao = 6-4y2 (31.5-6a) 
led 

ce, um 04 31.5-6b 

Yk+1 14 (0 —- yd) > ( ) 


(1— yp) 4-1 


= 31.5-6 
(yea pM 
1 
akpi = ae (1+ yesi)* — 29 yepi (1+ yeti t+ yp) > = (31.5-6d) 
= ag ((1+ yg)? — 259 yp as (1+ yn+1)” — yess) (31.5-6e) 
0 < og Sie ag (31.5-6f) 


Identities |31.5-6c| and |31.5-6e| show how to save operations. 


Borwein’s quartic (fourth order) iteration, variant r = 16, implemented in [hfloat: src/pi/pi4th.cc|, 
#FPM=164.4: 


1 — 9-1/4 8//2 — 3 

yo = 142-174 ag = EVEN (31.5-7a) 
(1-90) 4-1 
= >0 31.5-7b 
Yk+1 (1 = y4) 4+1 F ( ) 
1 

aky = ak(1 tyrr) 2t ye (Lt yeti t+ Yea) > = (31.5-7c) 
0 < ao rim ata 4 47 (31.5-7d) 


The operation count is unchanged, but this variant gives approximately twice as much precision after the 
same number of steps. The general form of the quartic iterations (relations |31.5-6a| .., and |31.5-7a| ..) 
is given in [66] pp.170ff]: 


yo = wvA'(r, | ae = a(r) (31.5-8a) 
(1— yg) | -1 
4u = 0 31.5-8b 
Yk+1 -y FI > 0+ ( ) 
1 
ar = ak (L + yea)! 27? Vr yr (Lt yeti tya) > - (31.5-8c) 


Ü < mente “ea re vt (31.5-8d) 
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Derived AGM iteration (second order, from [66] pp.46ff]), implemented in [hfloat: |src/pi/pideriv.cc], 


#FPM=276.2: 


Pk+1 


Dk T 


< 


v2, po — 24 V2, yi — oe 


5 (ve n) (k20 214 


1 
b ANS 
—————— (k>1) >1+ 
Yk +1 
zk +1 
k>1 > 
er (k > 1) T+ 
1072^^* 


(31.5-9a) 
(31.5-9b) 


(31.5-9c) 


(31.5-9d) 


(31.5-9e) 


Cubic AGM iteration (third order) from [70], implemented in [hfloat: src/pi/picubagm.cc|, ##FPM=182.7: 


Quintic (fifth order) iteration from [66] p.310], [hfloat: ¡src/pi/pi5th.cc|, #FPM=353.2: 


i: iu = 1 
an + 204 
3 
Ve (a2 + an bn + 02) 
3 
dus 


n ym 3* (a — ap 41) 


1 
so = 5(V/5 — 2), ag = 2 
gp = EN —4 
Sn 
y = (r-1y?47 >16 
T 1/5 
EET S 
E 25 si 
Snl T Sn(z 4- z/z +1)? 
2 n s? 5 1 
üs41 = Sid —5 — + V8n (s? — 25n + 5) — 
1 n 
d,—— < 16:5^*e-775 
T 


(31.5-10a) 
(31.5-10b) 
(31.5-10c) 


(31.5-10d) 


(31.5-11a) 


(31.5-11b) 
(31.5-11c) 
(31.5-11d) 


(31.5-11e) 
(31.5-11f) 


(31.5-11g) 


Cubic (third order) iteration from [30], implemented in [hfloat: src/pi/pi3rd.cc|, ZZFPM- 200.3: 


Sky 


Qk4 


= r?,10& — 3 (ri — 1) > 


1 || V¥3-1 
z "Ts 

3 
Troi — 1 


2 
1 


T 


(31.5-12a) 
(31.5-12b) 
(31.5-12c) 


(31.5-12d) 
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Nonic (9th order) iteration from [30], implemented in [hfloat: src/pi/pi9th.cc|, #FPM=273.7: 


1 de 
d = y T= x , sy = (1- rë)" (31.5-13a) 
t = 1+2ry (31.5-13b) 
u = (9r~(+re +r?) ^ (31.5-13c) 
v = P+tut+w (31.5-13d) 
2 

p 1, AE (31.5-13e) 

vU 

1 

a = mak +3?! (1-m) => = Gu. 5-136) 

(1 — r4)? 
E — — 1.5-13g 
kti (f+2u)v shes) 
Tra = (l-88) (31.5-13h) 


31.5.2 Measured timings and operation counts 


#FPM - order - routine in hfloat - time 
78.424 - 2 - pi_agm_sch() - 76 sec 
98.424 - 2 - pi.agnO = 93 sec 
99.510 - 2 - pi.agm3(fast variant) - 94 sec 

108.241 - 2 - pi_agm3(slow variant) - 103 sec 
149.324 - 4 - pi agm(quartic) - 139 sec 
155.265 - 4 - pi agm3(quartic, fast variant) - 145 sec 
164.359 - 4 - pi_4th_order(r=16 variant) - 154 sec 
169.544 - 4 - pi agm3(quartic, slow variant) - 159 sec 
170.519 - 4 - pi_4th_order(r=4 variant) - 160 sec 
182.710 - 3 - pi cubic agmO - 173 sec 
200.261 - 3 - pi 3rd orderO - 189 sec 
255.699 - 2 - pi 2nd order() - 240 sec 
273.763 - 9  - pi 9th order( - 256 sec 
276.221 - 2 - pi derived agmO - 259 sec 
353.202 - 5 - pi 5th order() - 329 sec 


Figure 31.5-A: Measured operations counts and timings for various iterations for the computation of 7 
to 4 million decimal digits. 


The operation counts and timings for the algorithms given so far when computing 7 to 4 million decimal 
digits (using 1 million LIMBs and radix 10,000) are shown in figure In view of these figures it 
seems surprising that the quartic algorithms pi, 4th. order() and the quartic AGM pi. agn (quartic) 
are usually considered close competitors to the second order AGM schemes. 


Apart from the operation count the number of variables used has to be taken into account. The algorithms 
using more variables (like pi 5th order O) cannot be used to compute as many digits as those using 
only a few (notably the AGM-schemes) given a fixed amount of RAM. Higher order algorithms tend to 
require more variables. 


A further disadvantage of the algorithms of higher order is the more discontinuous growth of the work: 
if just a few more digits are to be computed than are available after step k, then an additional step is 
required. Consider an extreme case where an algorithm T of order 1, 000 would compute 1 million digits 
after the second step, at a slightly lower cost than the most effective competitor. Then algorithm 7' would 
likely be the ‘best’ one only for small ranges in the number of digits around the values 10%, 10%, 109, .... 


Finally, it is much easier to find special arithmetical optimizations for the ‘simple’ (low order) algorithms, 
Schénhage’s AGM variant being the prime example. 
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31.5.3 More iterations for 7 


'The following iterations are not implemented in hfloat. 


Second order algorithm from [71]: 
ago = 


Mn+1 = 


An+1 = 


Implicit second order algorithm from (also in p.700]): 


ao 
(sn)? + (55? 
(1+ 35n41) 1+38;,) 


An+1 


14 y/(4— mn) (2 - m4) 


(1+3 Sn41) An — 2" Sn+1 


619 


(31.5-14a) 
(31.5-14b) 


(31.5-14c) 


(31.5-15a) 
(31.5-15b) 
(31.5-15c) 
) 


(31.5-15d 


It is easy to turn this algorithm into an explicit form as with the next algorithm. However, there exist 


iterations that cannot be turned into explicit forms. 


Implicit fourth order algorithm from [71] (also in [68] p.700]): 


ao = 1/3 
(sn)*+(s,)* = 1 
(1-435441) (14-355) = 2 


An+1 


Third order algorithm from [63]: 


v = 271/8, 
wo = 1, 00 
Un+1 = v? — {of + 


(1 + 5441)*05 4 


Wn4+1 = 
205,1 — Un (3v 
_ 202 1 
Qn1 = Ü T 
n 


Bn EE (6Wn+1Un V 204 1105) 


31.5-16a 
31.5-16b 
31.5-16c 


ASS 


— 
— Ww a Ha 


31.5-16d 


— 


(31.5-17a) 
(31.5-17b) 


(31.5-17c) 


(31.5-17d) 
(31.5-17e) 
(31.5-17f) 


(31.5-17g) 
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Combining two steps of the fourth order iteration leads to an algorithm of order 16, from [71] p.108]: 


a = 1/3, sı = V2-1 (31.5-18a) 
iom (iex) (31.5-18b) 
tn = 1/(1+s*)* (31.5-18c) 
Un = Ia(1+sn) (31.5-18d) 
2n—1 1 
An = 16YnQn-1 + 3 [l—122,—4yn] > = (31.5-18e) 
tn = l+s* (31.5-18f) 
2\71/4 
un = ES (1+ ss J (31.5-18g) 
ü-3) 
uu = n 1.5-18h 
"m (+ u)? (£ +0?) RIDE 
Quadratic iteration by Christian Hoffmann, given in [184] p.5]: 
a = v2, bo = 0, po = 2+v2 (31.5-19a) 
1 
ün+1 = 3 (y an T 1/y an) —1-c (31.5-19b) 
bn +1 
bui = Van a? (31.5-19c) 
1+ Gn--1 
Pn+1 Pn Un+1 E >T (31.5-19d) 


Note that relation |31.5-19b| deviates from the one given in the cited paper which seems to be incorrect. 
This is a variant of the iteration given as relations |31.5-9a} .. |31.5-9e on page 617| The values p, are 


identical in both iterations. 


Cubic iteration given in [65] p.125]: 


so = W3-2v3, ap = 1/2 (31.5-20a) 


Mn = 3/8n (31.5-20b) 
2 
anyi = [(s2 - 1) +2] dx (31.5-20c) 
1 
anyi = M2 an — 3" (m? +2m,—3)/2 > E (31.5-20d) 


The cited paper actually gives a more general form, here we take N = 1 for simplicity. 


Cubic iteration given in [97] p.1506, it-1.2]: 


to = 1/3, s = (v3-1)/2 (31.5-21a) 
Aaa 1143 
y 2 ¿ls D (31.5-21b) 
14-2(1—53 ) 
"^ ac 
— U as) : (31.5-21c) 
(1-524) "42 
la = (L+2sn)? tn- — 3°! (a +25)? — 1) > : (31.5-21d) 


Note the corrected denominator in relation |31.5-21b| (exponent of sn—1 is wrongly given as 2). 
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Quadratic iteration given in [97] p.1507, it-1.3]: 
ko = 0, $0 = 1/42 


1—,/1—s2_, 


Sn = == 
1+,/1—s2_, 
2 n 1 
kn = (lcs) kp-1+2" (l—sn)sn >= 
T 
Cubic iteration given in [97] p.1507, it-1.4]: 
ko = 0, so = 1/V2 
1— @/1—s3_, 
Sn = 
14231-3545 
kn = (1428 Rh  BHe37 V3s br A 
n — n n—1 n qst: 2 Sn = 
Quadratic iteration given in [97] p.1508, it-1.5]: 
ko = 0, Yo = 8/9 
y = 9 692 = 9 Un—1 + v Yn—1(4 = Un—1) 
Dura —6ys-1-1 
n Yn—1 (1 — Yn-1) 1 
kn = 2X3 4 — 3 Yn—1) kn- = 
v3 ü—- ipa + Yn-1) kn > 
Quadratic iteration given in p.1508, it-1.6]: 
ko = 0, Yo = 4/5 
2Y%-1 7 Un-i t 4j Áyn-1— 3y; a 
di Pus 
Qn = VUn- + 1) (4 = 3Yn—1) (ye 4 =3Yn-1 + 4) 
2” Yn-1 (1 — ya—1) 2 1 
kn mE n d 2 n— kn— => = 
V7 2 — Yn-1 Q ( id 1) i T 
Quadratic iteration (as product, two forms) given in [67] p.324]: 
2 1 
Ti = s (v6+2), yi = 5 (v6=+4) 
2 (Vin + ta) 
Ln, = —————— 
+l 1432, 
(0 2yn t Yn/ VEn + y Zn 
Yn+1 = 123%, 
oT = d 
z- 7 II (1+32,) 
Bls (1-394) /4 
_ _ 55346 il (7 1/2;:) 
7 3 14+ 3yn 


n=1 


The definitive source for iterations to compute 7 and the underlying mathematics is [66]. 
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(31.5-22a) 


(31.5-22b) 


(31.5-22c) 


(31.5-23a) 


(31.5-23b) 


(31.5-23c) 


(31.5-24a) 


(31.5-24b) 


(31.5-24c) 


(31.5-25a) 
(31.5-25b) 


(31.5-25c) 


(31.5-25d) 


(31.5-26a) 
(31.5-26b) 


(31.5-26c) 


(31.5-26d) 


(31.5-26e) 
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Chapter 32 


Logarithm and exponential function 


We describe algorithms for the computation of the exponential function (and hyperbolic cosine) and 
the logarithm (and inverse tangent). Constructions of super-linear iterations to compute the functions 
from their inverses are given. We also present argument reduction schemes and methods for the fast 
computation of the exponential and logarithm of power series. 


32.1 Logarithm 


32.1.1 AGM-based computation 
The (natural) logarithm can be computed using the relation (see [66] p.221]) 


TL 


llog(d) — R' (107^) + R'(107" d)| 1021) (32.1-1a) 


log(d) =  R'(107") — R'(107” d) (32.1-1b) 


IA 


which holds for n > 3 and d el, 1[. The first term on the right side is constant and can be saved for 
subsequent computations of the logarithm. We use the relation 


log(Mr*) = log(M)+X log(r) (32.1-2) 


where M is the mantissa, r the radix, and X the exponent of the floating-point representation. The value 
log(r) is computed only once. If M is not in the interval [1/2, 3/2] an argument reduction is done via 


log(M) = log(M s^) — f log(s) (32.1-3) 


Where 0 < M < 1 for the mantissa M, s = V2, and f € Z so that M sf e [1/2, 3/2]. The quantity 
log(s) = log(V2) can be precomputed directly via the AGM. A C++ implementation is given in [hfloat: 


src/tz/log.cc|. 
There is a nice way to compute the log(r), the logarithm of the radix, if the value of 7 has been precom- 
puted. We need to compute O3(q) and O2(q) where q = 1/r (see section on page [604) : 


Əsla) = 14274" (32.1-4) 
n=1 
For the computation of @2(q) we choose q = 1/r* =: b^: 
Ox(q) = 0+2 5 gara” = 2 5 pin’+4n+1 where q=0* (32.1-5a) 
n=0 n=0 


I 


2b y qn = 2b ( +5 em) (32.1-5b) 
n=0 n=1 
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Set q = 1/r, then relation [31.3-7|on page becomes 


T T 2 2 
———= = — = 4AGM (6; (q*) , 9» (d! 32.1-6 
log (1/4) log(r) (6: (a), 02 (°) CoD 

Functions to compute O2(b*), O3(54) where b is the inverse fourth power of the used radix r and 7/ log(r) 


are given in [hfloat: src/tz/pilogq.cc|. 
32.1.2 Computation by inverting the exponential function 
32.1.2.1 Iterations from the power series 
With an efficient algorithm for the exponential function, we can compute the logarithm using 
y «= 1l-de* (32.1-7a) 


log(d) = a+ log(1—y) (32.1-7b) 
= ag+log(1—(1—de™*)) = z-log(e *d) = x+ (—x + log(d)) (32.1-7c) 


Expansion of log(1 — y) as power series in y yields 


Vip y 
log(d) = z-log(l—-y) = cx G 9 + 3 | 1 +...) (32.1-8) 
Truncation of the series before the n-th power of y gives an iteration of order n: 
2 3 n—1 
sia = ts aele e te (32.1-9) 
2 3 n—1 
32.1.2.2 Iterations from Padé approximants 
z 
lw P == 
1,0 1 
2z 
2 => P = 
1,1 24 2 
62+ 2? 
3 — P = — 
ca 6+42z 
62+32? 
4 >» P = —————— 
d 6+62+ 2? 
30z +212 + z? 
5 o EP = 
Pa 30 + 36 z +92? 
60 602? + 11 23 
Eo Paa a Z+ 2^ zZ 


60 + 90 z 4- 362? +323 


Figure 32.1-A: Padé approximants for log(1 + z). 


The Padé approximants Pj;,;)(z) of log (1 — z) at z = 0 produce iterations of order i + j + 1. Com- 
pared to the power series based iteration one needs one additional long division but saves half of the 
exponentiations. This can be a substantial saving for high order iterations. 


The approximants can be computed via the continued fraction expansion of log(1 + 2): 


Cı Z 


log(1 +2) = 0+ (32.1-10) 


«00 -1Oo0t4i NA 
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where cı = 1 and 


k k—1 
Ck = 4(k — 1) if k even, Ck — E else (32.1-11) 


Using recurrence relations |37.3-7a] and |37.3-7b on page 719| with ag = 0, ax = 1 and by = cy - z we find 
32.1-A| 


what is shown in figure|32.1-A] The expressions are Padé approximants correct up to order k. For even k 
these are the diagonal approximants [k/2, k/2] which satisfy the functional equation log(1/z) = — log(z): 
P(1/z —1) = —P(z — 1). Further information like the error term of the diagonal approximants is given 
in 


The diagonal approximants can be computed by setting Po = 0, Qo = 1, Po = z, Qa = 1+ 2/2, and 
computing, for k = 4, 6, ..., 2n, 


P, = Ag Pk-2 + Br Pra (32.1-12a) 
Qe = ArQr-2 + Bk Qn—4 (32.1-12b) 
(these are relations|37.3-14a| and |37.3-14b on page 722). The Aj, and By, are defined as 
Ay = 1+2/2 (32.1-12c) 
z2 (k= 2)" 
oa 32.1-12d 
i 16 1— (k — 2? ( ) 


Then Pən/Qoən is the Padé approximant Pinn] of log(1+2) which is correct up to order 2n. The following 
GP function implements the algorithm: 


log_pade(n, z=’z)= 
{ /* Return Pade approximant [n,n] of log(1+z) */ 
local (P0,Q0,P2,Q2,tp,tq, t); 
if ( n«1, return(0) ); 
P020; Q0=1; 
P2=z; Q2=1+z/2; 
forstep (k=4, 2*n, 2, 
Ak = 1*z/2; \\ == +z*C(k-1)+z*C(k)+1; 
t = (k-2)72; 
z^2/16*t/(1-t); AN == -z^2*C(k-1)*C(k-2) ; 
Ak*P2 + Bkx*PO; 
tq = Ak*Q2 + Bk*QO; 
PO-P2; P2=tp; 
Q0-Q2; Q2-tq; 


); 
return( P2/Q2 ); 


ct 
"B 
| ow og! 


} 
32.1.2.3 Padé approximants for arctan 1 
A continued fraction for the inverse tangent is (given in p.569]) 


arctan(z) = z 2 E (32.1-13) 
‘+ E.) 
2 (57) 


1+ 


The Padé approximants P,/Q, for arctan are computed by setting Po = 0, P, = z, Qo = 1, Qi = 1, and 
the recurrences 


k? 2? 

Pray = Pec Pre Ak? 1 (32.1-14a) 
k2 2 

Qui = Qkc Pa 428-1 (32.1-14b) 


The first few approximants are shown in figure |32.1-B 
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SE 
2 P zu an n 
= i 342? 
15z 4- 4x? 
Ai RE ETT 
1052 + 552? 
4 P = 
us 105 + 902? + 9 x4 
Beah Pee 945 x + 735 x? + 64 x9 
54] = 945 + 10502? + 225 24 
6 o B - 1155 z + 1190 z3 + 231 2° 
5,6] " 1155 + 1575 x2 4-525 x4 + 25 xô 
k > Pj; = arctan(z) + O(z2k*) 


Figure 32.1-B: Padé approximants for arctan(z). 


32.1.3 Argument reduction for the logarithm 


If the logarithm is computed for moderate precision (up to several hundred decimal digits or so), the 
following scheme can beat the AGM algorithm. Use the functional equation for the logarithm 


log(z*) = alog(z) (32.1-15) 


to reduce the argument by setting a = 1/N. Now with N big enough z!/% will be close to 1: r :— z!/N = 


1 +e where e is small. Then a few terms of the power series of log(1 + e) = e — e? /2 + e?/3 +... suffice 
to compute the logarithm. Compute the logarithm of z as follows 


1. Set r 2 2N and e— r — 1. 
2. Compute l :— log(1 + e) to the desired precision using the power series. 


3. Return L :— N l. 


We can also use a Padé approximant in step 2. With argument z = 2.0, N = 2%, and four terms of the 

series we obtain: 

? z-2.0; \\ argument for log() 

7? n=32; N=27n; 

? r-z^(1/N) \\ compute by 32 sqrt extractions 
1.000000000161385904209659761203976631101985032744612016 

? e-r-1; NN small 
1.613859042096597612039766311019850327446120165053265785 E-10 

? 1-e-1/2*e^2*1/3*e^3-1/4*e^4 MAN approx log(1+e) 
1.613859041966370561665930136708022486594054133693140550 E-10 

? L-N*1 \\ final result 
0.6931471805599453094172321214581765680754060932265650365 

? log(z) \\ check with built-in log 
0.6931471805599453094172321214581765680755001343602552541 


We may also use the following reduction for L(z) := log(1 + z), which avoids loss of precision for small 
values of z: 


L(z) = (M) (32.1-16) 


32.1.4 Argument reduction for arctan 


We use the equation 


arctan(z) = 2 arctan ( (32.1-17) 


zZ 
== 


Compute the inverse tangent of z as follows 
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. repeat n times: r = r/(1 + v1 + r?) (for n big enough) 


We compute arctan(1.0) using n — 16 and four terms of the series: 


l. setr:—z 

2 

3 

4. return A := 2" a 
7 z=1.0; 
7T n=16; 


? r-z; for(k- 1,n,r=r/(1+sqrt(1+r72)));r 


0. 00001198422490593030347851163479465066131958512874402526189 


? a=r-1/3x*r"3+1/5x*r"5-1/7x*r 


^T 
pe 9000119842249053565/210717255929290581849745567656967660263 
E 7853981633974483096156608458198757210492552196703258299 
atan 


? 


? atan(z) \\ check with built-in 


0: 7853981633974483096156608458198757210492923498437764552 


All divisions in the reduction phase can be saved by using 


. compute a :— arctan(r) to the desired precision using the power series 


1 
arctan(1/z) = 2 arctan | ——————Z 32.1-18 
dii (72) A 
The inverse sine and cosine can be computed as 
z 
arcsin(z) = arctan {| ———— 32.1-19a 
(2) (==) (32.1-19a) 
/] — 22 
arccos(z) = arctan (=>) (32.1-19b) 
z 
32.1.5 Curious series for the logarithm t1 
We note two relations resembling the well-known series 
1 l+x 1 1 1 ; 
=] ? EE P 32.1-20 
¿dos ( 152) Au dn +1” ) 
The first is 
1 1+ 3x +32? 3? a 35 ge 
l = A a En E 32.1-21 
6 og (S) TUM CONO UT E i 
V see CENE T NE M. Maer ee 
z " i 2.1-21 
2o 12k +5 ka?” Tren? ) pon 
The second, given in [31], is 
1 l+x+2? a qa q = qar+l gkt2 
l = a a 32.1-22 
5 oe (THE) dr EET "Me Seit sei) ( ) 
Relation |32.1-21a|can be brought into a similar form: 
1 1 CR 5 7 11 13 
- log HO a psi a A (32.1-23a) 
24/8 1— V3z +2? 5 7 11 13 
oo me 12k4-5 12k+7 12k4-11 
_ > (+ à ia EO (32.1-23b) 
TEE 12k+5 12k+7  12k+11 
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Let Fx be the Fibonacci numbers and P; the Pell numbers (respectively entries A000045) and |A000129 
in ), then 


1 1422 + 2? "o p p 3 5" 

loz _ zi m eee gk 47 a5... — (82.124 
s oe (FETS d ge RM QM qu e ( 
1 1+ 22 +2? ZP, p^ 5 52 12? 

log = ^k gk — "E 4 54... (32.1-24 
5 log (H) 2a TG get ye Be + (3 b) 


The relations are special cases of the following identity. Fix u and let Vo = 0, Vi = 1, and V, = 
u Vy .4 + Vy 3, set a = u? + 4, then 


1 14 2x + 2? x Ve 
l k yk 32.1-24 
a eor) Ak ( e) 


32.2 Exponential function 


32.2.1 AGM-based computation of the exponential function 


We use q = exp (- T E) (relation 31.35 3-5 31.3-5/on page 605} and write 


K' — AGM(LE) AGM(l,bo) (32.2-1) 
K AGM(Lk)  AGM(1, b'o) 
where k' = bo and k = bh = \/1— b2 and use [206] p.38] 
a AGM(1, bo) . 1 4an 
= = lm —l 32.2-2 
2 AGM(L, b'o) noo 9n B o, uud) 
thereby 
1 Aan 1 4an 
q = exp|—2 lim log 2 = lim exp | -—— log 2 (32.2-3a) 
n—oo 2n Cn n—00 gn Cn 
4 apar 4 -1/(2^71) 
= lim (ew log on = lim ( 2) (32.2-3b) 
noo Cn n—>00 Cn 
This gives 
"T d +) 
q = Him (=) (32.2-4) 


One obtains an algorithm for exp(—x) by first solving for k, k' such that x = m K'/K (precomputed 7) 
and applying the last relation that implies the computation of a 2”~!-th root. Note that cn +1 should be 


computed as Cn+1 = 7 


For k = 1/V2 =: s we have k = k' and so q = exp(—1). Thus the calculation of exp(—7) = 
0.0432139182637 ... can directly be done via a single AGM computation as (c,/(4a4))" where N = 
1/2(^-9, The quantity i? = exp(—7/2) = 0.2078795763507 ... can be computed using N = 1/2”. 
32.2.2 Computation by inverting the logarithm 


32.2.2.1 Iterations from the power series 


The exponential function can be computed using the n-th order iteration 


2 3 n—1 
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The iteration can be derived as follows: 
exp(d) = xexp(d—log(x)) = x exp(y) where y := d-—log(z) (32.2-6a) 


2 y? y! 
- (1 Sabet. .) (32.2-6b) 


As the computation of logarithms is expensive one should use an iteration of high order. The C++ 


implementation given in [hfloat: src/tz/itexp.cc| uses the iteration of order 20. 


32.2.2.2 Iterations from Padé approximants 


2+2 
PB = 
2—z 
12+ 6z + 2? 
P, — 
E 12—6z4- 22 
P _ 120 + 60z + 122? + z? 
33] ' 120— 60z + 1222 — 23 
1680 + 840z + 1802? + 202? + z^ 
Pag = 2 3 4 
1680 — 840z + 180z* — 20z° + z 


Figure 32.2-A: Diagonal Padé approximants for the exponential function. 


The Padé approximants Pj j(z) of exp (z) give iterations of order i + j + 1. Te first few approximants 
for i = j are shown in figure |32.2-A| The functional equation exp(—z) = SO y holds for the diagonal 
approximants. In general, we have Pu ¡¡(=2) = 1/Pj;,(z). This can be seen from the following closed 
form 


i m j zy 
Pig) = pare: Gan 2 pare: T a (32.2-7) 


cul 


k 
The numerator for i = j, multiplied by (2i)!/i! to avoid rational coefficients, equals 


_ QD)! SQ) # Bo 
AE BS 


'The coefficients of the numerator and denominator in the diagonal approximant 


i k 
Pug = Hi? (32.2-9) 
2 i-o Ck (—2)* 


can be computed using c; = 1 (the coefficient of the highest power of z) and the recurrence 


Ch = Chyl (k E E) (32.2-10) 


It is usually preferable to generate the coefficients in the other direction. Compute the constant cy 
co = II (4w — 2) = 2, 12, 120, 1680, 30240, ... (32.2-11) 


w=1 


and use the recurrence 


(32.9219) 


We generate the coefficients for 1 < i < 8: 
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? cO(i)=prod(w=1,i,4*w-2) 
? qq(i,k)-(i-k)/((2*i-k)* (k*1)) 
? for (i=1, 8, c=cO(i); printi("["i,",",i,"]: "); \ 
? for (k-0, i, printi(" ", c); c*-qq(i,k)); printO;) 
[1,1]: 2.1 
[2,2]: 12 6 1 
[3,3] : 120 60 12 1 
[4,4]: 1680 840 180 20 1 
[5,5]: 30240 15120 3360 420 30 1 
[6,6]: 665280 332640 75600 10080 840 42 1 
[7,7] : 17297280 8648640 1995840 277200 25200 1512 56 1 
[8,8]: 518918400 259459200 60540480 8648640 831600 55440 2520 72 1 


Finally, the approximant Pr; ; can be expressed as ratio of hypergeometric series: 


E —j 
Pig = F "E F . |- 2.2-1 
dá ey Jj! E :) pon) 
This is relation |36.2-33 on page 693| with a = —i and b = —j where i and j are positive integers. 


32.2.3 Argument reduction for the exponential function and cosine 


As for the logarithm an argument reduction technique can be useful with moderate precisions. We do 
not use the functional equation for the exponential function (exp(2z) = exp(z)?) because of the loss of 
precision when adding up the terms of the power series (1 plus a tiny quantity). Instead we use the 
functional equation for E(z) := exp(z) — 1: 


E(2z) = 2E(z)+ E?(z) (32.2-14) 
Compute the exponential function of z as follows 
1. Set r = z/2" (for n big enough). 
2. Compute E := exp(r) — 1 to the desired precision using the power series. 
3. Repeat n times: E = 2 E + E?. 
4. Return E +1. 


We compute exp(1.0) using n — 16 and eight terms of the series: 


7 z=1.0; 

7? n=16; 

? r=z/2'n 
0.00001525878906250000000000000000000000000000000000000 

? E=rx*(1+r/2x*(1+r/3*(1+1/4* (1+r/5* (1+r/6* (1+r/7* (1+r/8) )) 
0.00001525890547841394814004262248066173018701234845511 

? for (k=1,n,E=2*E+E*2) ;E=E+1 
2.718281828459045235360287471352662497757247071686614582 

? exp(1.0) MN check with built-in ex 
2.718281828459045235360287471352662497757247093699959575 


00 


0000 
)))) 
622583 


We can also compute the exponential function via the hyperbolic cosine or sine: 


exp(z) = cosh(z)+sinh(z) = cosh(z) + q/cosh*(2) — 1 (32.2-15a) 
= sinh(z) + ysinh(2)? + 1 (32.2-15b) 


The advantage is that half of the coefficients of the power series are zero. Again we do not use the 
functional equation for the hyperbolic cosine (cosh(2z) = 2 cosh?(z) —1) but that for C(z) :— cosh(z) — 1: 


C(2z) = 2(C(z)+1)?-2 = 2C(z)? +4C(z) (32.2-16) 
Compute the hyperbolic cosine as follows 
1. set r = z/2" (for n big enough) 


2. compute C :— cosh(r) — 1 to the desired precision using the power series 
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3. repeat n times: C = 2[C +1]? — 2 
4. return C + 1 


We compute cosh(1.5) using n = 16 and four terms of the series: 


? z=1.50; 

? n=16; 

? r-z/2^n 
0.00002288818359375000000000000000000000000000000000000000000 

? C=1/2xr"2+1/24x*r"4+1/720*r"6+1/40320*r”8 
0.0000000002619344741220382773076639849940582903433100542426455923 

? for(k-1,n,C-2*(C*1)^2-2);C 

de ,352409615243247325767667965441644170173960682574839216 


Qt 

2s SO eU D ED RU I 39606929 4539216 
? cosh(z) \\ check with built-in cosh 

d. 352409615243247325767667965441644170173960748865373192 


? 


If the series for cos(z) — 1 is used, then the cosine can be computed by the identical algorithm: 


? z=1.50; 

? n=16; 

? r-z/2^n 
0.00002288818359375000000000000000000000000000000000000000000 

? C--1/2*r^2*1/24*r^4-1/720*r^6*1/40320*r^8 
-0.0000000002619344740991683877317978758392098468831796333435288181 

? for(k-1,n,C-2*(C*1)^2-2);C 
-0,9292627983322970899118101485657312909149089413817931623 


c+1 

o MA Eq M P MM Kd 
? cos(z) AN check with built-in 

0. 07073720166770291008818985143426870908509102756334686942 


? 


Compute the sine as sin(z) = \/1 — cos?(z)= V—2C — C? and the tangent as tan(z) = sin(z)/cos(z). 


32.3 Logarithm and exponential function of power series 


The computation of the logarithm, the exponential function, and the inverse trigonometric functions 
turns out to be surprisingly simple with power series. 


32.3.1 Logarithm 


Let f(x) be a power series in x and g(x) = log(f(x)). Then we have ea) = £@) and 


glz) = log(f 5 2) (32.3-1) 


(x) 


A few lines of GP demonstrate this: 


? sp=8;default(seriesprecision,sp+1); 
? f=taylor((1)/(1-x-x*2),x) /* shifted Fibonacci (with constant term) */ 
1 + x + 24x72 + 3x73 + 5*x74 + 84x75 + 13*x^6 + 21*x^7 + 34*x^8 + 0(x^9) 


? d=deriv(f,x) 
1 + 4*x + 9xx"2 + 20*x^3 + 40*x74 + 78x75 + 147*x76 + 272*x^7 + 0(x78) 


? q-d/f /* the only nontrivial computation */ 
1 + 3x + 4*x^2 + T*x^3 + 11*x"4 + 18x75 + 29%x76 + 47*x"7 + 0(x^8) 


? 1f=intformal (q) 
x + 3/2*x^2 + 4/3*x^3 + 7/4*x"4 + 11/5*x75 + 3*x"6 + 29/7*x"7 + A7/8*x^8 + 0(x"9) 


? f-exp(1f) /* check with built-in exp() */ 
0(x^9) 


32.3.2 Inverse trigonometric functions 


Now let a(x) = arctan(f (z)). Then, symbolically, 
f 
dre PESO (32.3-2) 


Verification for the trivial case f(x) = x: 
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? sp=13;default (seriesprecision,sp+1); 
? f=taylor(x,x) 

x + 0(x^14) 
? d=deriv(f,x) 

1 + 0(x^13) 
? q-d/(i*f^2) 

1 - x^2 + x^4 - x^6 + x^8 - x^10 + x^12 + 0(x^13) 
? af-intformal(q,x) 

x - 1/3*x^3 + 1/5*x75 - 1/7*x"7 + 1/9*x^9 - 1/11*x711 + 1/13*x^13 + 0(x^14) 
? f-tan(af) /* check with built-in tan() */ 
0(x^14) 


For s(x) = arcsin(f(r)) use 


s(r) = [| zx (32.3-3) 


32.3.3 Exponential function 
With e(x) = exp(f(z)) we can use a scheme similar to those shown in section |29.7 on page 583| We 


express a function g(y) as 


gy) = [L+ (32.3-4) 


where Y, = y, Yr+1 = N (Yp) and 1+ T(y) is the truncated power series of g. A second order product is 
obtained by taking 1 + T(y) = 1 + y (the series of exp(y) truncated before the second term) and 


Ny) = f (297) (32.3-5) 


For g(y) = exp(y) we have N(y) = y — log(1 + y) and 


exp(y = [1 + Yi] (32.3-6) 


— 


k=1 


where Y, = y = f(x) and Yz41 = Yp — log(1 + Y;). The product Il is correct up to order y?" -!. The 
computation involves N — 2 logarithms and N — 1 multiplications. Implementation in GP: 


texp(y, N=5)= 


local(Y, e, t); 

Y=y; e=1+Y; 

for (k=2, N, 

= deriv(1+Y,x)/(1+1); 

= intformal(t); WW here: t = log(1+Y); 
- t . 


t 
t 
Y , 
e (1*3); 


* 


mmm 
N= Cc000-1O0»0 RON -— 


; 
return( e ); 


Check: 


? f=taylor((x)/(1-x-x*2) ,x) 

X + X72 + 2*x^83 + 3*x74 + b*x^b + 8*x^6 + 13xx"7 + 21*x^8 + ... 

e=exp(f) /* built-in expO */ 

1 + x + 3/2*x^2 + 19/6*x^3 + 145/24*x^4 + 467/40*x^5b + 16051/720*x^6 + ... 
t-texp(f,4); 


u 


ux 


t-e 
-1/32768*x^16 - 35/98304*x"17 - ... 


'The a-th power of a power series $ can be computed as 


S^ = exp|a log(S)] (32.3-7) 
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32.4 Simultaneous computation of logarithms of small primes 


We describe a method to compute the logarithms of a given set of (small) primes simultaneously. We 
define 


L(z) := 2arccoth(z) = 2 2 Opa EFD — (32.4-1) 
and note that (relation |36.3-23d on page 699) 
1 
log(z) = 2 arccoth = (32.4-2) 
z— 


We will determine a set of relations that express the logarithm of a prime as linear combination of terms 
L(X;) where the X; are large integers so that the series for L converges quickly. 


S = { 51744295, 170918749, 265326335, 287080366, 362074049, 587270881, 
617831551, 740512499, 831409151, 1752438401, 2151548801, 2470954914, 3222617399 } 


2: [ -1595639, -17569128, -8662593, -31112926, -13108464, -11209640, 

-12907342, 49745611, -1705229, -12058985, 44580610, 44775383, -12972664 ] 
3: [ -2529028, -27846409, -13729885, -49312821, -20776424, -17766859, 

-20457653, 415446428, -2702724, -19113039, 47260095, 47568803, -20561186 ] 
5: [ -3704959, -40794252, -20113918, -72241977, -30436911, -26027978, 

-29969920, 422628608, -3959419, -28000096, +10635847, 411088096, -30121593 ] 
7: [ -4479525, -49322778, -24318973, -87345026, -36800111, -31469438, 


-36235490, 427359389, -4787183, -33853851, 412859398, 413406195, -36418872 ] 


11: [ -5520004, -60779197, -29967648, -107633040, -45347835, -38778983, 

-44652067, 433714275, -5899123, -41717234, +15846307, 416520111, -44878044 ] 
13: [ -5904566, -65013499, -32055403, -115131507, -48507081, -41480597, 

-47762841, +36063046, -6310097, -44623547, +16950271, 417671017, -48004561 ] 
17: [ -6522115, -71813158, -35408027, -127172929, -53580360, -45818987, 


-52758281, 439834823, -6970060, -49290653, +18723073, 419519201, -53025282 ] 


19: [| -6778159, -74632382, -36798067, -132165454, -55683805, -47617738, 

-54829453, +41398649, -7243689, -51225694, +19458099, 420285481, -55106936 ] 
23: [ -7217972, -79475039, -39185776, -140741248, -59296949, -50707501, 

-58387161, 444084875, -7713709, -54549566, +20720673, 421601741, -58682649 ] 
29: [ -7751584, -85350490, -42082712, -151146003, -63680669, -54456218, 


-62703622, +47343993, -8283970, -58582320, 422252516, 423198720, -63020955 ] 


31: [ -7905109, -87040909, -42916186, -154139543, -64941904, -55534757, 

-63945506, 448281670, -8448039, -59742579, «22693241, 423658185, -64269124 ] 
37: [ -8312407, -91525553, -45127374, -162081337, -68287932, -58396097, 

-67240196, 450769306, -8883311, -62820720, 423862474, 424877135, -67580488 ] 
41: [ -8548719, -94127517, -46410292, -166689119, -70229278, -60056229, 


-69151756, 452212618, -9135853, -64606639, +24540856, 425584363, -69501722 ] 


Figure 32.4- A: Relations for the fast computation of the logarithms of the primes up to 41. 


Compute log(p;) for the primes p; in a predefined set P of n primes as follows: 
1. Find a set S of numbers X € Z so that X? — 1 factor completely into the primes in P. 
2. Select a subset of n (large) numbers X; so that all L( X5) are linearly independent. 
3. Try to find, for each prime p;, a relation log(p;) = Yt m; L(X;). If this fails return to step 2. 
For example, with the first 13 primes (P — (2, 3, 5, 7, 11, ..., 41]) we find 
B = XS Xu (32.4-3) 


(51744295, 170918749, 265326335, 287080366, 362074049, 587270881, 
617831551, 740512499, 831409151, 1752438401, 2151548801, 2470954914, 3222617399} 
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We use the short form p: [mi, m2, m3, ..., m13] to denote a relation 
13 
log(p) = M mjL(X;) (32.4-4) 
j=l 


Now we have the relations given in figure |32.4-A| the first is 


log(2) =  —1595639 L(51, 744, 295) — 17569128 L(170,918, 749) +... — 12972664 L(3, 222, 617, 399) 


The series with slowest convergence (with argument X4 = 51,744, 295) already gives more than 15 digits 
per term: we have log), (X?) ~ 15.4. The last series gives 19 digits per term. 


2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41 


51744295: [ -2, +2, 0, +3, 0, 0, +2, -3, -1, +1, 0, 0, -1] 
170918749: [ +1, +4, -5, +1, +1, +1, +1, 0, -1, -1, +1, 0, -1] 
265326335: [ -7, -2, 0, +1, -1, +1, 0, -2, 0, -1, +2, +1, +1 ] 
287080366: [ 0, +1, +1, -5, +2, +1, 0, -1, +3, -1, -1, 0, 0] 
362074049: [ +5, -4, -2, +1, 0, -2, 0, 0, -2, +2, +2, 0, 0] 
587270881: [ +4, +2, +1, +3, -1, 0, -1, 0, 0, +1, -1, -3, +1 ] 
617831551: [ -6, +4, +2, +1, 0, -6, 0, +1, 0, 0, +1, +1, O] 
740512499: [ -1, -1, -5, -2, +7, -1, 0, +1, 0, 0, -1, 0, 0] 
831409151: [ -9, -1, +2, -1, +3, +1, 0, 0, -1, 0, +2, 0, -2 ] 

1752438401: [ +6, -4, +2, 0, -2, -2, 0, +2, -2, 0, 0, +1, +1 ] 
2151548801: [ +6, -2, +2, 0, 0, -2, 0, 0, +2, -4, +1, 0, +1] 
2470954914: [ 0, 0, -1, +1, -3, -5, +2, 0, 0, 0, +3, 0, +1] 
3222617399: [ -2, -6, -2, +4, +1, +2, 0, +2, -1, 0, -2, 0, 0] 


Figure 32.4-B: Values L(x) as linear combinations of logarithms of small primes. 


Figure|32.4-B|shows the linear combinations of logarithms of small primes that give the values L(x). The 
first row is the relation 


L(51,744,295) = —2 log(2) + 2 log(3) + 3 log(7) +...—1 log(41) (32.4-5) 
The shown values, as a matrix, are the inverse of the values in figure 


Precomputed logarithms of small primes can be used for the computation of the logarithms of integers 
k if one can determine a smooth number near k. For example, the logarithm of 65537 (a prime) can be 
computed as 


65536 65537 
1 = | -——_ } = ] A. l 2.4- 
og (65537) og (essor 3 08 (F) + log (65536) (32.4-6a) 
1 
= log (1 + am) + 16 log(2) (32.4-6b) 


The series of the first logarithm converges fast and log(2) is precomputed. Jim White suggested this 
approach [priv. comm.]. If k is not near a smooth number but u - k is smooth where u factors into the 
chosen prime set, use the relation 


log(k) = log(uk) — log(w) (32.4-7) 
Here log(u) is the sum of precomputed logarithms and with log(uk) we proceed as above. 
32.5  Arctangent relations for m 1 
We consider relations of the form 


1 1 1 
ho my arctan — + m» arctan — +... + m, arctan — (32.5-1) 
4 21 X2 Tn 
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+4[5] -1[239] == 1 * Pi/4 

+12[18] +8[57] -5[239] == 1 * Pi/4 

+44[57] +7[239] -12[682] +24[12943] == 1 * Pi/4 

+88[192] +39[239] +100[515] -32[1068] -56[173932] == 1 * Pi/4 

+322[577] +76[682] +139[1393] +156[12943] +132[32807] +44[1049433] == 1 * Pi/4 


+1587 [2852] +295[4193] +593[4246] +359[39307] 
+481[55603] +625[211050] -708[390112] == 1 * Pi/4 


*2192[5357] +2097[5507] -227[9466] +832[12943] 
+537 [34522] -2287[39307] -171[106007] -708[1115618] == 1 * Pi/4 


+3286 [34208] +9852[39307] +5280[41688] +7794[44179] 
+7608[60443] +4357[275807] -1484[390112] -1882[619858] +776[976283] == 1 * Pi/4 


*1106[54193] -30569[78629] -28687[88733] -13882[173932] 
*9127[390112] -9852[478707] -24840[1131527] +4357 [3014557] 
*21852[5982670] 423407 [201229582] == -1 * Pi/4 


-36462[390112] +135908[485298] +274509[683982] -39581[1984933] 
*178477[2478328] -114569[3449051] -146571[18975991] *61914[22709274] 
-69044[24208144] -89431[201229582] -43938[2189376182] == 1 * Pi/4 


*893758[1049433] +655711[1264557] *310971[1706203] +503625 [1984933] 
-192064[2478328] -229138[3449051] -875929[18975991] -616556 [21638297] 
-187143[22709274] -171857[24208144] -251786[201229582] -432616[2189376182] == 2 * Pi/4 


Figure 32.5-A: Best n-term arctan relations currently known for 2 < n < 12. 


13: +1126917[3449051] +1337518[4417548] ... -216308[2189376182] == 1 * Pi/4 

14: +446879[6826318] +5624457[8082212] ... +483341[17249711432] == 1 * Pi/4 

15: +5034126[20942043] +1546003[22709274] ... +1337518[250645741818] == 1 * Pi/4 

16: +14215326 [53141564] +6973645[54610269] ... +8735690[34840696582] == 1 * Pi/4 

17: +12872838 [201229582] +27205340[203420807] ... +35839320[134520516108] == 1 * Pi/4 

18: +2859494[299252491] -41068896[321390012] ... -89623108[18004873694818] == -1 * Pi/4 

19: +270619381[778401733] -138919506 [1012047353] ... +146407224[30038155625330] == 1 * Pi/4 
20: +807092487 [2674664693] +479094776 [2701984943] ... +214188292[564340076432] == 1 * Pi/4 
21: +598245178[5513160193] -115804626 [7622130953] ... -1521437626[38057255532937] == 1 * Pi/4 


Figure 32.5-B: The best n-term arctan relations (shortened) currently known for 13 € n < 21. 


where k, M1,...,Mp, 21,..., X4 € Z (in fact, k = 1 almost always). This is an n-term relation. For 
example, a 4-term relation, found 1896 by Størmer [326], is 


1 1 


n = +44 arctan 5 +7 arctan 239 7 12 arctan 682 + 24 arctan — (32.5-2) 
We use the following compact notation 
mi[xi] +m2[x2] + ... +mn[xn] == k * Pi/4 
for relation [32.5-1 on the previous page] For example, Stermer's relation [32.5-2] would be written as 
*44[57] +7[239] -12[682] +24[12943] == 1 * Pi/4 


We write the relations so that the arguments x; are strictly increasing. Further, n-term relations are 
sorted so that the first arguments zı are in decreasing order (if xı... vj coincide with two relations, then 
the arguments 2;+1 are used for sorting). For example, a few 6-term relations are 


+322[577] +76[682] +139[1393] +156[12943] +132[32807] +44[1049433] == 1 * Pi/4 
*122[319] +61[378] +115[557] +29[1068] +22[3458] +44 [27493] == 1 * Pi/4 
+100[319] +127[378] +71[557] -15[1068] +66[2943] +44[478707] == 1 * Pi/4 
+337 [307] -193[463] +151[4193] +305[4246] -122[39307] -83[390112] == 1 * Pi/4 
*183[268] +32[682] +95[1568] +44 [4662] -166[12943] -51[32807] == 1 * Pi/4 


Note that the second and third relation are sorted according to their fifth arguments (3458 and 2943). 
Among all n-term relations we consider a relation better than another if it precedes it. The first one is 
the best relation. Our goal is to find the best n-term relation for n small. For example, the relation 
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+322[577] +76[682] +139[1393] +156[12943] +132[32807] +44[1049433] == 1 * Pi/4 
is the best (known!) 6-term relation. The best n-term relations for 2 € n < 12 currently known are 
shown in figure Note that k = —1 in the 10-term relation and k = 2 in the 12-term relation. The 


best relations for 13 < n < 21 (shortened to save space) are shown in figure|32.5-B| Figure |32.5-C| gives 
just the first argument (21) of the best relations for 2 < n < 27. 


n-terms min-arg 
2 5 Machin (1706) +4[5] -1[239] == 1 * Pi/4 
3 18 Gauss (YY?) +12[18] +8[57] -5[239] == 1 * Pi/4 
4 57 Stormer (1896) 
5 192 JJ (1993), prev: Stormer (1896) 172 
6 577 JJ (1993) 
7 2,852 JJ (1993) 
8 5,357 JJ (2006), prev: JJ (1993) 4,246 
9 34,208 JJ (2006), prev: JJ (1993) 12,943, prev: Gauss (Y?) 5,257 
10 54,193 JJ (2006), prev: JJ (1993) 51,387 
11 390,112 JJ (1993) 
12 1,049,433 JJ (2006), prev: JJ (1993) 683,982 
13 3,449,051 JJ (2006), prev: JJ (1993) 1,984,933 
14 6,826,318 JJ (2006) 
15 20,942,043 HCL (1997), prev: MRW (1997) 18.975,991 
16 53,141,564 JJ (2006) 
17 201,229,582 JJ (2006) 
18 299,252,491 JJ (2006) 
19 778,401,733 JJ (2006) 


20 2,674,664,693 JJ (2006) 
21 5,513,160,193 JJ (2006) 
22 17,249,711,432 JJ (2006), prev: 16,077,395,443 MRW (27-Jan-2003) 
23 58,482,499,557 JJ (2006) 
24 102,416,588,812 JJ (2006) 
25 160,422,360,532 JJ (2006) 
26 392,943,720,343 JJ (2006) 
27 970,522,492,753 JJ (2006) 


Michael Roby Wetherfield 
HCL := Hwang Chien-Lih 
JJ := Joerg Arndt 


Figure 32.5-C: First arguments of the best n-term arctan relation known today, for 2 < n < 27. 


32.5.1 How to find one relation 


In the 5-term relation 
+88[192] +39[239] +100[515] -32[1068] -56[173932] == 1 * Pi/4 


factor 15 + 1 for all (inverse) arguments zy: 


en dg 


w 
Ne 
+ 


Note that all odd prime factors are the four primes 5,13, 73,101. The coefficients m; can be computed 
as follows. Write (for all arguments z;) 


22-1 = 260 gett) 13602) 73:08 191904 (32.5-3) 
Now define a matrix M using the exponents e(j, u) (ignoring the prime 2): 


ME, := e(j,i) (32.5-4) 


«00 O) Ot i WUN c 
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The sign of Mj; is minus if (x; mod p;) < p;/2. With our example we find 


transpose(M) :- 
[-5, -1, +1, -2] // 1739327241 == 575 *13^1 *73^1 *101^72 
[+6, 0, +1, 0] // 10687241 == 576 *7371 
[ 0, +1, 0, -2] // 51572+1 == 1371 *101°2  (*2) 
[ 0, -4, 0, 0] // 239724 == 1374 (*2) 
[-1, 0, +1, +1] // 1927241 == 571 *7371 *10171 

// 5, 13, 73, 101 «--- primes 


For the signs of the upper left 3 x 2 sub-matrix, note that (173932 mod 5) = 2 « 5/2, (173932 mod 13) = 
5 « 13/2, (1068 mod 5) = 3 > 5/2, and (515 mod 13) = 8 > 13/2. The nullspace of M consists of one 
vector: 


[-56, -32, 100, 39, 88] 
This tells us that 
+88[192] +39[239] +100[515] -32[1068] -56[173932] == k * Pi/4 


We determine that k = 1 by a floating-point computation of the left side. Quite often one finds a relation 
where k = 0, but we are not interested in those. For example, the candidates 12943, 1068, 682, 538, 239 
lead to factorizations into (2 and) the odd primes 5, 13, 61, 73. The matrix M is 


transpose(M) = 
[+4, +3, -1, 0] // 12943°2+1 == 5^4 *13^3 *61^1 (*2) 
[+6, 0, 0, +1] // 106872+1 == 576 *73^1 
[-3, 0, -2, 0] // 6827241 == 5^3 *61^2 
[+1, -1, +1, -1] // 53872+1 == 571 *13^1 *61^1 *73^1 
[0,-4, 0, 0] // 2397241 == 1374 (#2) 


// 5, 13, 61, 73 <--= primes 
The nullspace of M is [1, -1, -1, -1, 1] and the relation is 
+1[239] -1[538] -1[682] -1[1068] +1[12943] == 0 


32.5.2 Searching for sets of candidate arguments 


A set of candidate arguments x; will give a relation only if the ae +1 factor into a common set of primes. 
Apart from the factor 2, all prime factors are of the form 47+ 1. One can choose a subset of those primes 
S :— (pi, ..., Pu} and test which of the products P = 2° - pf! --- pê» are of the form P = z? +1. The test 
is to determine whether P — 1 is a perfect square. The GP function issquare() does this in an efficient 
way (as described in [110]). A recursive implementation of the search is 


W global variables: 
ct-0; \\ count solutions 
av-vector(1000); \\ vector containing solutions 
\\ pv = [...]; NN vector of primes of the form 4*i+1 
m-10^20; \\ search max := sqrt(m) 
check(t)- 
local(a); 
if ( issquare(t-1, ta), ct++; av[ct] = a; ); 
if ( issquare(t+t-1, £a), ct++; av[ct] = a; ); 


} 
gen_rec(d, p)= 


local(g, gg, t); 
if ( d>length(pv), return() ); 
g = pv[dl; 
gg = 1; 
while ( 1, 
t = p * gg; 
if ( tom, return() ); 
if ( gg!=1, check(t) ); 


gen rec(d4*1, t); 
) Bg *- g; 
return(); 


} 
We do the search using the four primes 5, 13, 61, and 73: 
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pv=[5, 13, 61, 73]; NN vector of primes 
gen rec(1, 1); \\ do the search 


The candidates found are 
12943, 1068, 682, 538, 239, 57, 27, 18, 11, 8, 7, 5, 3, 2 


The following relations are found: 


+1[239] -1[538] -1[682] -1[1068] +1[12943] == NULL 

+44[57] +7[538] -5[682] +7[1068] +17[12943] == 1 * Pi/4 (5-term) 

+1[27] +42[57] +6[538] -5[682] +7[1068] +16[12943] == 1 * Pi/4 (6-term) 

+1[18] +41[57] +6[538] -5[682] +6[1068] +16[12943] == 1 * Pi/4 (6-term) 

eo O +6[538] -5[682] +6[1068] +15[12943] == 1 * Pi/4 (6-term) 
=-snip-- 

+1[3] +26[57] +4[538] -3[682] +4[1068] +10[12943] == 1 * Pi/4 (6-term) 

*1[2] +18[57] +3[538] -2[682] +3[1068] +7[12943] == 1 * Pi/4 (6-term) 


The search is reasonably fast for up to about 12 primes. However, one needs to guess which prime set 
may lead to a good arctan relation. The particular set of primes 


{5, 13, 17, 29, 37, 53, 61, 89, 97, 101} (32.5-5) 


led me (1993) to the relation 
-36462[390112] +135908[485298] +274509[683982] -39581[1984933] 
+178477 [2478328] -114569[3449051] -146571[18975991] +61914[22709274] 
-69044 [24208144] -89431[201229582] -43938[2189376182] == 1 * Pi/4 
which is still the best 11-term relation known today. 


The April-2006 computations were done with a more exhaustive search described in the next section. 


32.5.3 Exhaustive search for sets of candidate arguments 


We want to find all x where z? + 1 factors into (2 and) the first 64 primes of the form 4i +1 (S = 
{5, 13,17, 29,...,761}). Call the resulting set of candidates A. We will later try (for small n) all (n — 1)- 
subsets of S and test whether the corresponding subset of A leads to an arctan relation. 


The simplest approach is to factor (for x up to a practical maximum) all z? +1 and add z to the set A if 
all odd prime factors of z? -- 1 are in S. This method, however, is rather slow: about 11,000 CPU cycles 
are needed for each test. 


A much faster approach is the following sieving method. We can determine x such that a given prime p 


divides x? + 1 by solving x? = —1 (mod p) as shown in section |39.9 on page 784| We can further solve 
x? = —1 (mod p^) for all h as shown in the cited section. Initialize an array with the value 1 for even 


indices, else with 2 (x? +1 is even if and only if z is odd). For each prime p € S do, for all powers p^, as 
follows: multiply the array entries with indices s, s + p^, s + 2 p^, s -- 3p^,... where s? = —1 (mod p^) 
by p. Finally find the entries with index x that are equal to x? + 1, these are the candidates. 


We can use the logarithm of a prime and add it instead of multiplying by the prime, then we need to test 
whether entry x is (approximately) equal to log(z? + 1). 


The array can be avoided altogether by using priority queues (see section on page|162). An event 
scheduled for index z corresponding to a prime power p^ will trigger addition of log(p) to bucket x. The 
event must then be rescheduled to x + p^. 


Almost all computations of the logarithm can be avoided by observing that both x? +1 and the logarithm 
are strictly increasing functions. We call a number z so that 2? +1 has all odd prime factors in S a 
candidate. The sum of logarithms (of primes) for candidates x are equal to log(z? + 1). If a was the last 
candidate, then for the next candidate b the sum of logarithms must be strictly greater than that for a. 
Therefore we only need to compute log(1? + 1) if a new sum of logarithms is greater than the one for 
the candidate found most recently. It turns out that a logarithm is computed exactly whenever a new 
candidate is found. 
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The search costs about 250 cycles per test, which is a good improvement over the first attempt. Analysis 
of the machine code shows that most of the time is spent in the reschedule operations. 


The final improvement comes from the separation of the frequent events (small prime powers) from the 
rare events (big prime powers). Again we need an array, but only a small one that fits into level-1 cache. 
A segmented search has to be used. 


Now we need to find the threshold beyond which an event is considered rare. Very surprisingly, it turns 
out that the search is fastest if all events are considered frequent! This means that we can forget about 
the priority queues. A better suited algorithm (and implementation) for a priority queue might give 
different results. 


The resulting routine is remarkably fast, it uses just slightly more than 11 cycles per test. It was used 
to determine all candidates x < 10!4. The search took about eight days. The last entries in the list of 
candidates are 


9920543180219672+1 
99238108604548”2+1 
9931131403564372+1 
99395767528881”2+1 
99501239756693^2*1 
99627378461772^2*1 
99759820688082^2*1 
99849755159917^2*1 
99950583525307^2*1 
99955223464153^2*1 


[13.29.37.53.89.157.241.257.337.373.40172.761] 
[5.29.37.6172.101.349.39772.433.55772.661] 
[2.5^2.13.29.73.113.233.241.269.281.293.317.349.461] 
[2.13.29.37.53.149.173.181.193.313.353.373.401.449] 
[2.5^4.13.29.37^72.61.233.277.313^3.317.401] 
[5.1372.37.41.7372.137.277.281.521.557.617.761] 
[572.17.29.37.109.181.257.269.337.389.409.457.653] 
[2.5.89.101.181.233.257.293.389.457.521.557.677] 
[2.573.13.173.181.193.24173.257.45772.677] 
[2.5.13.6172.10172.109.373.421.433.509.709.757] 


The search produced 43,936 candidates (including 0 and 1). Exactly that many logarithms were com- 
puted. This means that on average one logarithm was computed for one in 1014/43, 936 > 2- 10? values 
tested. 


We can extend the list by testing (for each element x found so far) whether x + d or x + (1? + 1)/d are 
new candidates: 

[x] == [x+d] + [x+(x72+1)/d] where d divides x*2+1 
Additionally we can try the arguments on the right side of relations like 

[x] == 2[2*x] - [4*x^343*x] 

[x] == [2*x-1] + [2*x+1] - [2*x"3+x] 

[x] == 3[3*x] - [(9*x73+7*x)/2] - [(27*x^349*x)/2] 


Michael Roby Wetherfield has developed a more sophisticated approach for extending the list and sent 
me a big set of candidates beyond 10!*. His methods are described in [353] (see also [354], [331], [43], 
and ). We note that a single value, x = 276, 914, 859, 479, 857, 813, 947 where 


x°2+1 = [2.5.13.17.2973.41.5372.7372.101.157.181.229.241.313.397.401.509.577] 
was discarded because it is greater than 2% = 18, 446, 744, 073, 709, 551, 616. 
We note the curious relation 
[ka] = [(k+1)a]+[(k+1) ka] — [(k* -- 2 k +k?) a? + (K? +k+4+1)a] (32.5-6) 
Set f(a, k) :— (kt -- 2 k? +k?) a? + (k? + k+1) a, then we have 


fla, k? 41 = (0 a)? + 1) i («& +1)a)?+ 1) l (ca 41) kay! + 1) (32.5-7) 


32.5.4 Searching for all n-term relations 


To find all n-term relations whose arguments are a subset of our just determined list of candidates, we 
have to test all subsets of (n — 1) (out of 64) odd primes, select the corresponding values z, and compute 
the nullspace as described. Let A; be the j-th candidate. An array M of 64-bit auxiliary values is used. 
Its j-th entry Mj is a bit-mask corresponding to the odd primes in the factorization of Aj +1: bit 2 of 
M; is set if the i-th odd prime divides A7 +1. 


To find n-term relations, we must try all [x subsets of size n — 1 out of the 64 odd primes in our 
scope. The bit-combination routine from section [1.24 on page 62| was used for this task. The selection 
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of the entries that factor completely in the subset of n — 1 primes under consideration can be done with 
a single bit-AND and a branch. The candidates with more than n — 1 odd primes in their factorization 
should be discarded before the search. 


While the search is very fast for small n, it does not finish in reasonable time for n > 8. A considerable 
speedup is obtained by splitting our N = 64 odd primes into a group of the 20 smallest and b = 64— 20 = 
44 ‘big’ primes. Write (q — n — 1 and) 


= QU PG Ue — ce 


Y C) P - >) (32.5-8b) 
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This means, we first select the 7 = 0, 1, 2, ...—subsets of the big primes. We copy the corresponding 
candidates whose big prime factors are in the current subset into a new array B. The size of B will be 
significantly smaller than the size of A. From this array we select the arguments according to subsets 
of the small primes (leaving the subset of big primes fixed). This results in a much improved memory 
locality and accelerates the search by a factor of about 25. 


n prime set of best relation 

2 113) 

3 15, 13} 

4 15, 13, 61} 

5 {5, 13, 73, 101} 

6 {5, 13, 61, 89, 197} 

7 {5, 13, 17, 29, 97, 433} 

8 {5, 13, 29, 37, 61, 97, 337} 

9 {5, 13, 17, 29, 41, 53, 97, 269} 

10 {5, 13, 17, 41, 53, 73, 97, 101, 157} 

11 {5, 13, 17, 29, 37, 53, 61, 89, 97, 101} 

12 {5, 13, 17, 29, 37, 53, 61, 89, 97, 101, 197} 

13 {5, 13, 17, 29, 37, 53, 61, 89, 97, 101, 181, 281} 

14 {5, 13, 17, 29, 37, 53, 61, 89, 97, 101, 181, 269, 457} 
15 {5, 13, 17, 29, 37, 41, 53, 61, 89, 97, 101, 181, 337, 389} 


Figure 32.5-D: Primes with the best n-term relations known. 


Still, the limit for n so that an exhaustive search can be done has only been moved a little. But if we look 
at the prime sets that lead to the best relations, shown in figure |32.5-D| we observe that small primes 
are much ‘preferred’. 


The data suggests that the best possible relation is found long before the search space is exhausted. 
Therefore we stop after the number of big primes in the subset is greater than, say, 4. Both parameters, 
the number b of primes considered big and the maximum number of primes taken from that set, should 
be chosen depending on n. 


Another important improvement is to discard small candidates before the search. This spares us a huge 
amount of uninteresting relations with small first arguments xı. Obviously, the amount of nullspace 
computation is also reduced significantly. 


The results of the searches can be found in [20]. While the searches for the n-term relations with n > 11 
did not even exhaust the table of candidates (which in turn is incomplete!), we can be reasonably sure 
that we found the best relations within our scope (of the first 64 odd primes 4i + 1). Indeed I do not 
expect to see a better relation for any n < 15. 


To improve on the results, one may use the first 128 odd primes 4i + 1, sieve up to 10! (distributed on 
100 machines) and a 3-phase subset selection instead of the described 2-phase selection. The selection 
(a nullspace computation) stage should also be done in a distributed fashion to reasonably exhaust the 
table of candidates. Such a computation will likely improve on some of the relations with more than 17 
terms and produce up to 35-term relations that are in the vicinity of the best possible. 
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A method for the simultaneous computation of logarithms of small primes that uses a similar method to 


the one given here is described in section |32.4 on page 632 
32.5.5 Checking pairs 


+12[18] +8[57] -5[239] == 1 * Pi/4 
+8[10] -1[239] -4[515] == 1 * Pi/4 
+44[57] +7[239] -12[682] +24[12943] == 1 * Pi/4 
+20[57] +24[68] +12[117] -5[239] == 1 * Pi/4 
+44[57] +7[239] -12[682] +24[12943] == 1 * Pi/4 
+24[53] +20[57] -5[239] +12[4443] == 1 * Pi/4 
+68[99] +27[239] -4[307] -12[12238] -24[58911] == 1 * Pi/4 
-56[99] +39[239] +20[307] -24[2332] +12[6948] == 1 * Pi/4 
+68[99] +27[239] -4[307] -12[12238] -24[58911] == 1 * Pi/4 
+44[99] +51[239] +44[307] -12[682] +24[12943] == 1 * Pi/4 
+44[99] +51[239] +44[307] -12[682] +24[12943] == 1 * Pi/4 
+56[99] +27[239] +20[307] +24[568] -12[19703] == 1 * Pi/4 
*122[319] +61[378] +115[557] +29[1068] +22[3458] +44[27493] == 1 * Pi/4 
+100[319] +127[378] +71[557] -15[1068] +66[2943] +44[478707] == 1 * Pi/4 
+39[239] +188[307] +32[2332] -44[6948] +112[32318] -56[55368] == 1 * Pi/4 
+95[239] +132[307] -136[2332] +68[6948] +56[12238] +112[58911] == 1 * Pi/4 
*95[239] +132[307] -34[682] +90[12943] +22[34522] +22[106007] == 1 * Pi/4 
+139[239] +88[307] -56[682] -44[5357] +68[12943] +88[39307] == 1 * Pi/4 
+27[239] +132[307] +80[568] +112[1123] +44[19703] -56[160590] == 1 * Pi/4 
+83[239] +76[307] +80[568] +56[1113] -12[19703] +56[4180652] == 1 * Pi/4 
+776[4193] +593[4246] +2212[5701] +481[34208] +1321[39307] \ 
*962[44179] +1106[219602] -708[390112] == 1 * Pi/4 
-330[4193] +1699[4246] +2212[5648] +1587[34208] +215[39307] \ 
-144[44179] +1106[48737] +398[390112] == 1 * Pi/4 
+625[4052] +295[4193] +1555[4246] +1587[9210] +481[37107] \ 
+359[39307] +962[299655] -1189[390112] == 1 * Pi/4 
+1106[4052] +776[4193] +593[4246] +1106[9210] +481[34208] \ 
+1321[39307] +962[44179] -708[390112] == 1 * Pi/4 
+6056[10842] +4062[34208] +3796[39307] +962[44179] +776[139693] \ 
-2475[275807] -1484[390112] -1882[619858] -776[201229582] == 1 * Pi/4 
+5280[10842] +4838[34208] +776[38280] +4572[39307] +1738[44179] \ 
-3251[275807] -708[390112] -2658[619858] -1552[1460857] == 1 * Pi/4 
+6056[10842] +4062[34208] +3796[39307] +962[44179] +776[139693] -2475[275807] \ 
-1484[390112] -1882[619858] -776[201229582] == 1 * Pi/4 
+5280[10842] +4838[34208] +776[38280] +4572[39307] +1738[44179] -3251[275807] \ 
-708[390112] -2658[619858] -1552[1460857] == 1 * Pi/4 


Figure 32.5-E: Checking pairs of arctan relations for the computation of 7. 


When computing 7 via arctan relations one should make reasonably sure that no error occurred. To 
minimize the extra work, a checking pair of relations should be used. The checking pair 


+12[49] +32[57] -5[239] +12[110443] == 1 * Pi/4 
+44 [57] +7[239] -12[682] +24[12943] == 1 * Pi/4 


is given in [32]. The values arctan(1/57) and arctan(1/239) occur in both relations. Figure|32.5-E|shows 
some checking pairs where the differing terms tend to be rapidly convergent. 


In a checking pair the multipliers must be different for a shared argument of the arctan, the following 
two relations are not a checking pair: 


+56[99] +27[239] +32[307] +12[4193] -12[39307] == 1 * Pi/4 
+56[99] +39[239] +20[307] -24[2332] +12[6948] == 1 * Pi/4 


The term arctan(1/99) has the same multiplier 56 in both relations, so an error in the computation of 
this term would go undetected. 
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Chapter 33 


Computing the elementary functions 
with limited resources 


This chapter presents two types of algorithms for computations with limited resources, the shift-and-add 
and the CORDIC algorithms. The algorithms allow the computation of elementary functions such as 
the logarithm, exponential function, sine, cosine and their inverses with only shifts, adds, comparisons 
and table lookups. Some early floating-point units (FPUs) used CORDIC algorithms and your pocket 
calculator surely does. 


33.1 Shift-and-add algorithms for log,(x) and b” 


Shift-and-add algorithms use only additions, multiplications by a power of 2 (‘shifts’), and comparisons. 
A precomputed lookup table with as many entries as the desired accuracy in bits is required. The 
algorithms are especially useful with limited hardware capabilities. 


The implementations given in this section use floating-point numbers. They can be rewritten to use 
scaled integer arithmetic without difficultly. 


33.1.1 Computing the base-b logarithm 


We will use a table that contains the values Az = log, (1 + zx); it is created as follows: 


double *shiftadd ltab; // element [0] unused 
ulong ltab n; 


void 
make shiftadd ltab(double b) 
1 


double lib - 1.0 / log(b); 
double s = 1.0; 

for (ulong k-0; k«ltab n; ++k) 
1 


Shiftadd ltab[k] = log(1.0+s) * lib; // == log b(1*1/2^k) 


s *= 0.5; 


Ree Ree 
Hx O3 h3 C 00 NDJ Ne 


} 
The algorithm takes as input the argument x > 1 and the number of iterations n and computes log, (x): 
1. Initialize: set £j = 0, eo = 1, and k = 1. 
2. Compute uj = e + (14274). If up € x then set dp = 1, else set dy = 0. 
3. If dp Z 0, then set thi, = tk + Aj and ej, = uz and repeat the last step. Else set tp+1 = t; and 
Ck+1 = €k- 


4. Increment k. If k = n return tz, else goto step 2. 


A C++ implementation is given in [FXT: arith/shiftadd-log-demo.cc |: 
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k: Uk tk êk Ax 

init | - 0.00000000 | +1.00000000 | +1.00000000 
1: 1.50000000 | 0.00000000 | +1.00000000 | +0.58496250 
2: 1.25000000 | 0.00000000 | +1.00000000 | +0.32192809 
2: 1.56250000 | 0.32192809 | +1.25000000 | +0.32192809 
3: 1.40625000 | 0.32192809 | +1.25000000 | +0.16992500 
3: 1.58203125 | 0.49185309 | +1.40625000 | +0.16992500 
4: 1.49414062 | 0.49185309 | +1.40625000 | +0.08746284 
5: 1.45019531 | 0.49185309 | +1.40625000 | +0.04439411 
6: 1.42822265 | 0.49185309 | +1.40625000 | +0.02236781 
7T: 1.41723632 | 0.49185309 | +1.40625000 | +0.01122725 
8: 1.41174316 | 0.49185309 | +1.40625000 | +0.00562454 
8: 1.41725778 | 0.49747764 | +1.41174316 | +0.00562454 
9: 1.41450047 | 0.49747764 | +1.41174316 | +0.00281501 
10: 1.41312181 | 0.49747764 | +1.41174316 | +0.00140819 
10: 1.41450182 | 0.49888583 | +1.41312181 | +0.00140819 
it: 1.41381182 | 0.49888583 | +1.41312181 | +0.00070426 
11: 1.41450215 | 0.49959010 | +1.41381182 | +0.00070426 
12: 1.41415698 | 0.49959010 | +1.41381182 | +0.00035217 
12: 1.41450224 | 0.49994228 | +1.41415698 | +0.00035217 
13: 1.41432961 | 0.49994228 | +1.41415698 | +0.00017609 
14: 1.41424330 | 0.49994228 | +1.41415698 | +0.00008805 
15: 1.41420014 | 0.49994228 | +1.41415698 | +0.00004402 
15: 1.41424330 | 0.49998631 | +1.41420014 | +0.00004402 
00: 1.41421356 | 0.50000000 | +1.41421356 | +0.00000000 

=g =log,(V2) | =a =0 


Figure 33.1-A: Numerical values occurring in the shift-and-add computation of log,(/2) = 1/2. The 
computation of log, j2(v2) = —1/2 corresponds to the same values but opposite signs for all entries Az 


and yx. Note that certain steps are repeated (for k = 2,3,8,10, 11,12, 15). 


double 
Shiftadd log(double x, ulong n) 
1 
if ( n»-ltabn) n = ltab n; 
double t = 0.0; 
double e = 1.0; 
double v = 1.0; 
// [PRINT] 
for (ulong k-1; k<n; ++k) 
1 
v *2 0.5; // v == (1>>k) 
double u; 
bool d; 
vue (1) 
u=e+ex* v; // use; ut=(e>>k); 
d = ( u<=x ); 
// [PRINT] 
if ( d--false ) break; 
t += shiftadd ltab[k]; 
e= u; 
} 
} 
return t; 
} 


The variable v is a power of 1/2, therefore all multiplications by it can, with scaled integer arithmetic, 
be replaced by shifts as indicated in the comments. The values for the first steps of the computation for 
the argument zo = V2 are given in figure [33.1-A] The columns of the figure correspond to the variables 
u(= uy), t(= tx), e(— ex), and shiftadd ltab[kl(— Az). 


The algorithm has been adapted from [256] (chapter 5) where the correction is made only once for each 
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k: Uk tk €k Ax 

init | - 0.00000000 | +1.00000000 | +1.00000000 
1: 1.50000000 | 0.00000000 | +1.00000000 | +0.58496250 
1 2.25000000 | 0.58496250 | +1.50000000 | +0.58496250 
1 3.37500000 | 1.16992500 | +2.25000000 | +0.58496250 
1 5.06250000 | 1.75488750 | +3.37500000 | +0.58496250 
1 7.59375000 | 2.33985000 | +5.06250000 | +0.58496250 
1 11.3906250 | 2.92481250 | +7.59375000 | +0.58496250 
2 9.49218750 | 2.92481250 | +7.59375000 | +0.32192809 
3 8.54296875 | 2.92481250 | +7.59375000 | +0.16992500 
4 8.06835937 | 2.92481250 | +7.59375000 | +0.08746284 
5 7.83105468 | 2.92481250 | +7.59375000 | +0.04439411 
5 8.07577514 | 2.96920662 | +7.83105468 | +0.04439411 
6 7.95341491 | 2.96920662 | +7.83105468 | +0.02236781 
6 8.07768702 | 2.99157443 | +7.95341491 | +0.02236781 
oo 8.00000000 | 2.99999999 | +8.00000000 | +0.00000000 

I = log,(8) =x =0 


CONDO 0) Na 


Figure 33.1-B: Values occurring in the first few steps of a shift-and-add computation of log,(8) = 3. 


value A; limiting the range of convergence to x < X where 


oo 


1 
X = II (1 + x) = 4.7684620580621434482997985 7/356 794471543 . . . (33.1-1) 
k=0 


As given, the algorithm converges for any x > 0, x 4 1. A numerical example for the argument x = 8 is 
given in figure|33.1-B] The base b must satisfy b > 0 and b Z 1. 

33.1.2 Computing b” 

We can use the same precomputed table as with the computation of log; (z). 


The algorithm takes as input the argument x and the number of iterations n and computes b” for b > 1, 
x ER. It proceeds as follows: 


1. Initialize: set ty = 0, ep = 1, and k = 1. 
2. Compute uk = tk + Ax. If uz < x the set d; = 1, else set dy = 0. 


3. If dy 4 0, then set tk}1 = Uk and ek+1 = ex (1 + 2-*) and repeat the last step. Else set t,41 = tk 
and eg44 = ex. 


4. Increment k. If k = n return ez, else goto step 2. 


A C++ implementation is given in [FXT: arith/shiftadd-exp-demo.cc|: 


double 


Shiftadd exp(double x, ulong n) 
1 
if ( n»-ltabn) n = ltab_n; 
double t = 0.0; 
double e = 1.0; 
double v = 1.0; 
// [PRINT] 
for (ulong k-1; k<n; ++k) 
1 
v *= 0.5; // v == (1>>k) 
double u; 
bool d; 
pude (1) 
u = t + shiftadd_ltab[k]; 
d = ( u<=x ); 
// [PRINT] 
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k: Uk tk êk Ax 
init | 0.00000000 | 0.00000000 | +1.00000000 | +0.00000000 
1: | 0.58496250 | 0.00000000 | +1.00000000 | +0.58496250 
2: | 0.32192809 | 0.00000000 | +1.00000000 | +0.32192809 
2: | 0.64385618 | 0.32192809 | +1.25000000 | +0.32192809 
3: | 0.49185309 | 0.32192809 | +1.25000000 | +0.16992500 
3: | 0.66177809 | 0.49185309 | +1.40625000 | +0.16992500 
4: | 0.57931593 | 0.49185309 | +1.40625000 | +0.08746284 
5: | 0.53624721 | 0.49185309 | +1.40625000 | +0.04439411 
6: | 0.51422090 | 0.49185309 | +1.40625000 | +0.02236781 
7: | 0.50308035 | 0.49185309 | +1.40625000 | +0.01122725 
8: | 0.49747764 | 0.49185309 | +1.40625000 | +0.00562454 
8: | 0.50310219 | 0.49747764 | +1.41174316 | +0.00562454 
9: | 0.50029266 | 0.49747764 | +1.41174316 | +0.00281501 
10: | 0.49888583 | 0.49747764 | +1.41174316 | +0.00140819 
10: | 0.50029403 | 0.49888583 | +1.41312181 | +0.00140819 
11: | 0.49959010 | 0.49888583 | +1.41312181 | +0.00070426 
11: | 0.50029437 | 0.49959010 | +1.41381182 | +0.00070426 
12: | 0.49994228 | 0.49959010 | +1.41381182 | +0.00035217 
12: | 0.50029446 | 0.49994228 | +1.41415698 | +0.00035217 
13: | 0.50011838 | 0.49994228 | +1.41415698 | +0.00017609 
14: | 0.50003033 | 0.49994228 | +1.41415698 | +0.00008805 
15: | 0.49998631 | 0.49994228 | +1.41415698 | +0.00004402 
15: | 0.50003034 | 0.49998631 | +1.41420014 | +0.00004402 
oo: | 0.50000000 | 0.50000000 | +1.41421356 | +0.00000000 
=x =% = 21/2 = 0 


Figure 33.1-C: Numerical values occurring in the shift-and-add computation of b* = 21/2 = V2. The 
values are printed at points where a comment [PRINT] appears in the code. 


20 if ( d==false ) break; 
21 t = u; 
e += e xv; // et=(e>>k); 


33.1.3 An alternative algorithm for the logarithm 


A slightly different method for the computation of the base-b logarithm (b > 0, b Æ 1) is given in [212] 


ex.25, sect.1.2.2, p.26]. Here the table used has to contain the values Az = log, (=) 


double *briggs_ltab; 
ulong ltab_len; 


1 
2 
1 void 

5 make briggs ltab(ulong na, double b) 
6 (t 

T 

8 


double lib - 1.0 / log(b); 
double s = 2.0; // == 2^k 


9 briggs_ltab[0] = -1.0; // unused 

10 for (ulong k-1; k«na; ++k) 

11 1 

12 briggs ltab[k] = log(s/(s-1.0)) * lib; 

13 s *= 2.0; 

14 } 

15 > 
The algorithm works for x > 1 and terminates when a given precision (eps) is reached [F XT: arith/briggs- 
2 

1 double 


2  briggs log(double x, double eps) 
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k: Tk Yk Zk Ak 
init | 1.41421356 | 0.00000000 | +0.70710678 | +0.00000000 
2: 1.41421356 | 0.00000000 | +0.35355339 | +0.41503749 
2: 1.06066017 | 0.41503749 | +0.26516504 | +0.41503749 
3: 1.06066017 | 0.41503749 | +0.13258252 | +0.19264507 
4: 1.06066017 | 0.41503749 | +0.06629126 | +0.09310940 
5: 1.06066017 | 0.41503749 | +0.03314563 | +0.04580368 
5: 1.02751454 | 0.46084118 | +0.03210982 | +0.04580368 
6: 1.02751454 | 0.46084118 | +0.01605491 | +0.02272007 
6: 1.01145962 | 0.48356126 | +0.01580405 | +0.02272007 
T: 1.01145962 | 0.48356126 | +0.00790202 | +0.01131531 
T: 1.00355759 | 0.49487657 | +0.00784029 | +0.01131531 
8: 1.00355759 | 0.49487657 | +0.00392014 | +0.00564656 
9: 1.00355759 | 0.49487657 | +0.00196007 | +0.00282051 
9: 1.00159752 | 0.49769709 | +0.00195624 | +0.00282051 
10: 1.00159752 | 0.49769709 | +0.00097812 | +0.00140957 
10: 1.00061940 | 0.49910666 | +0.00097716 | +0.00140957 
11: 1.00061940 | 0.49910666 | +0.00048858 | +0.00070461 
11: 1.00013081 | 0.49981128 | +0.00048834 | +0.00070461 
12: 1.00013081 | 0.49981128 | +0.00024417 | +0.00035226 
13: 1.00013081 | 0.49981128 | +0.00012208 | +0.00017612 
13: 1.00000873 | 0.49998740 | +0.00012207 | +0.00017612 
oo: 1.00000000 | 0.50000000 | +0.00000000 | +0.00000000 
=1 = log,(V2) | =0 =0 


Figure 33.1-D: Numerical values occurring in the computation of log,(/2) = 1/2. The value of k is 
incremented in the inner loop (comment [PRINT1] in the code, the value of z changes). The values of z 
and y change just before the location of the comment [PRINT2], corresponding to consecutive rows with 
same value of k. 'The computation of log, ji ( V2) — —]/2 corresponds to the same values but opposite 
signs for all entries Aj, and yp. 


0; 
x * 0.5; 


double y - 
double z = 
// [PRINT] 


ulong k = 1; 


double v = 0.5; // v == 2^(-k) 


while ( fabs(x-1.0) 
{ 


ma ( fabs(x- 


z *- 0.5; 
++k; v *= 


>=eps ) 


z)<1.0 ) 


0.5; 


if ( k >= ltab_len ) 


// [PRINT1] 


+= briggs ltab[k]; 


} 
Xx -= Z; 
y 
z 


=x*v; // 
// invariant: 


" // [PRINT2] 


done: 
return jy; 
T 


The values for first steps of the computation for the argument zo = V2 are given in figure 33.1-D 


z=(x>>k) 


goto done; 


y_k + log_b(x_k) == log_b(x_0) 


// no more table entries 


646 Chapter 33: Computing the elementary functions with limited resources 


33.2 CORDIC algorithms 


The CORDIC algorithms can be used for the computation of functions like sine, cosine, exp and log. 
The acronym CORDIC stands for Coordinate Rotation Digital Computer. Similar to the shift-and-add 
algorithms only multiplications by powers of 2 (shifts), additions, subtractions and comparisons are used. 
Again, a precomputed lookup table with as many entries as the desired accuracy in bits is required. 


33.2.1 The circular case: sine and cosine 


init | 0.60725293 | 0.00000000 | +1.04719755 | +0.00000000 
0: | 0.60725293 | 0.60725293 | +0.26179938 | -0.78539816 
1: | 0.30362646 | 0.91087940 | -0.20184822 | -0.46364760 
2: | 0.53134631 | 0.83497278 | +0.04313044 | +0.24497866 
3: | 0.42697471 | 0.90139107 | -0.08122455 | -0.12435499 
4: | 0.48331166 | 0.87470515 | -0.01880574 | +0.06241880 
5: | 0.51064619 | 0.85960166 | +0.01243409 | +0.03123983 
6: | 0.49721492 | 0.86758051 | -0.00318963 | -0.01562372 
7: | 0.50399289 | 0.86369602 | +0.00462270 | +0.00781234 
8: | 0.50061908 | 0.86566474 | +0.00071647 | -0.00390623 
9: | 0.49892833 | 0.86664251 | -0.00123664 | -0.00195312 
10: | 0.49977466 | 0.86615528 | -0.00026008 | +0.00097656 
11: | 0.50019758 | 0.86591124 | +0.00022819 | +0.00048828 
12: | 0.49998618 | 0.86603336 | -0.00001594 | -0.00024414 
13: | 0.50009190 | 0.86597233 | +0.00010612 | +0.00012207 
14: | 0.50003904 | 0.86600285 | +0.00004508 | -0.00006103 
15: | 0.50001261 | 0.86601811 | +0.00001457 | -0.00003051 
oo: | 0.50000000 | 0.86602540 | +0.00000000 | +0.00000000 
= cos(1/3) | —sin(n/3) | =0 =0 


Figure 33.2-A: Numerical values occurring in the CORDIC computation of cos(7/3) and sin(/3). 


We start with a CORDIC routine for the computation of the sine and cosine. The lookup table has to 


COND CUu 05 Ne 


contain the values arctan(2~") for k = 0,1,2,3,..., these are stored in the array cordic ctab[]. An 
implementation of the function is given in [FXT: arith/cordic-circ-demo.cc |: 
void 
cordic_circ(double theta, double &s, double &c, ulong n) 
1 
double x = cordic_1K; 
double y = 0; 
double z = theta; 
double v = 1.0; 
// [PRINT] 
for (ulong k-0; k<n; ++k) 
1 
double d = ( z>=0 ? +1 : -1 ); 
double tx = x - d * v * y; 
double ty = y +d * v * x; 
double tz = z - d * cordic_ctab[k]; 
x = tx; y= ty; Z = tZ; 
v *= 0.5; 
// [PRINT] 
} 
c = x; 
s = y; 
} 


For the sake of clarity floating-point types are used. All operations can easily be converted to integer 
arithmetic. The multiplications by d are sign changes and should be replaced by an if-construct. The 
multiplications by v are shifts. 


The values for the first 16 steps of the computation for the argument zo = 0 = 7/3 = 1.04719755... are 
given in figure|33.2-A| While z gets closer to 0 (however, the magnitude of z does not necessarily decrease 
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with every step) the values of x and y approach sin(1/3) = 1/2 and cos(m/3) = V3/2 = 0.86602540..., 
respectively. 


More formally, one initializes 


Zo = 1/K = 0.607252935008881... (33.2-1a) 
yo = 0 (33.2-1b) 
z = 0 (33.2-1c) 


and iterates (starting with k — 0) 


Ay = arctan (27*) (precomputed) (33.2-1d) 
WR = 2} (33.2-1e) 
dx = sign(zy) (33.2-1f) 
Tk+1 =  Xg —dRUkyk > cos(@) (33.2-1g) 
Yk+1 = yk dyvugzy —>sin(0) (33.2-1h) 
Zk1 = 2k — dk Ak >0 (33.2-11) 
The scaling constant K is 
K = || Yes (33.2-2a) 
k=0 
K =  1.646760258121065648366051222282298435652376725701027409... (33.2-2b) 
1 
K” 0.6072529350088812561694467525049282631123908521500897724... (33.2-2c) 


We note that K can be computed more efficiently as K = \/2 F(1/4) where F(z) is defined as 


F(z) = [[Q@+2) (33.2-3) 
k=1 
We use relation |16.4-23|on page and relation |16.4-15a} F(z) = P(z*)/P(z) where 
P(e) = 14> C2 (roh 4 teen) (33.2.4) 
k=1 


Using n terms of the sum gives a precision of about 3 (n — 1)? bits: 
? pent(z, n)» 1+sum(k=1,n, (-1)^k*(z^(k*(3*k-1)/2) + z^(k*(3*k41)/2))); 


? n-30; u=0.25; K=sqrt( 2 * pent (u^2,n)/pent(u,n) ) 
1.646760258121065648366051222282298435652376725701027409 


The CORDIC algorithm converges if —r € zo < r where 


r = M arctan(2*) (33.2-5a) 
k=0 

r = 1.743286620472340003504337656136416285813831185428206523 .. . (33.2-5b) 

r > 5 = 157079632... (33.2-5c) 


With arguments Zo, yo, zo one has 
x£ — K (xo cos(zo) — yo sin(zo)) (33.2-6a) 


y — K (yo cos(zo) + xo sin(zo)) (33.2-6b) 
z => 0 (33.2-6c) 
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which for zo = 1/K, yo = 0, zo = 0 specializes to the computation as above. 


A nice feature of the algorithm is that it also works backwards: initialize as above and use the same 


iteration with the slight modification that dj := — sign (yx), then 
r > Ky/a+ye (33.2-7a) 
y > 0 (33.2-7b) 
z — 2 — arctan (2) (33.2-7c) 
0 


The algorithm can be derived by writing 


ine - Beca Bor) M (33.2-8) 


Yk+1 +sin(dy Az) +cos(d; Ax)} | Ye 


and noting that (using dk = +1, so cos(dy Az) = cos(Ax) and sin(d; Az) = dy sin(Az)) 


tkil +1  —dkuk| | rp E 
Fes] = cos(A,) m T +1 | s (33.2-9) 


where vj, = 2-^. The CORDIC algorithm postpones the multiplications by cos( Ag). We have 


1 


————— 33.2-10 
FER is 


cos(A,) = cos (arctan(2~*)) = 


and 


K = 1/ II cos( Ak) = II v 1-4 2-2k (33.2-11) 
k=0 k=0 


33.2.2 The linear case: multiplication and division 


A slight variation gives a base-2 multiply-add algorithm: 


Ag = d (33.2-12a) 

ig = 2% (33.2-12b) 

dí = sign(zz) (33.2-12c) 

tpi = Tk (33.2-12d) 

Yet = Uk + dk Uk Tk (33.2-12e) 

Žk+1 = 2k— dk Ax (33.2-12f) 

We have 

> 20 (33.2-13a) 

— Yo + Xo Zo (33.2-13b) 

z= 0 (33.2-13c) 


Going backwards (replace relation |33.2-12c| by dj, :— — sign (yx)) gives an algorithm for division: 


£ > Lo (33.2-14a) 
y > 0 (33.2-14b) 
Pe ee eee (33.2-14c) 


Zo 
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k: Tk Uk Zk Ax 
init | 1.20749706 | 0.00000000 | +1.00000000 | +0.00000000 
1: 1.20749706 | 0.60374853 | +0.45069385 | -0.54930614 
2: 1.35843420 | 0.90562280 | +0.19528104 | -0.25541281 
3: 1.47163705 | 1.07542707 | +0.06962382 | -0.12565721 
4: 1.53885124 | 1.16740439 | +0.00704225 | -0.06258157 
+4: 1.61181401 | 1.26358259 | -0.05553931 | -0.06258157 
5: 1.57232706 | 1.21321340 | -0.02427913 | +0.03126017 
6: 1.55337060 | 1.18864579 | -0.00865286 | +0.01562627 
T: 1.54408430 | 1.17651008 | -0.00084020 | +0.00781265 
8: 1.53948856 | 1.17047850 | +0.00306606 | +0.00390626 
9: 1.54177465 | 1.17348532 | +0.00111293 | -0.00195312 
10: 1.54292063 | 1.17499096 | +0.00013637 | -0.00097656 
11: 1.54349436 | 1.17574434 | -0.00035190 | -0.00048828 
12: 1.54320731 | 1.17536751 | -0.00010776 | +0.00024414 
13: 1.54306383 | 1.17517913 | +0.00001430 | +0.00012207 
+13: 1.54320729 | 1.17536749 | -0.00010776 | -0.00012207 
14: 1.54313555 | 1.17527330 | -0.00004673 | +0.00006103 
15: 1.54309968 | 1.17522621 | -0.00001621 | +0.00003051 
oo: 1.54308063 | 1.17520119 | +0.00000000 | +0.00000000 
— cosh(1) — sinh(1) =0 =0 


(00 IDO C402 Ne 


Figure 33.2-B: Numerical values occurring in the CORDIC computation of cosh(1) and sinh(1). Note 
that steps 4 and 13 are executed twice. 


33.2.3 The hyperbolic case: sinh and cosh 


'The versions presented so far can be unified as 


a = M (33.2-15a) 
Tial = Tk — Mdx UL Yk (33.2-15b) 
Yeti = Yet dk Uk Tk (33.2-15c) 
Zkyl = 2k — dk Ax (33.2-15d) 


where the linear case corresponds to m = 0 and A, = 27", the circular case to m = 1 and Aj; = 
arctan(2-^). The forward direction (‘rotation mode’) is obtained by setting dp = sign(zi), the backward 
direction (‘vectoring mode’) by setting d; = — sign (yx). 


Setting m = —1 gives a CORDIC algorithm for the computation of the hyperbolic sine and cosine or their 
inverses. The lookup table has to contain the values arctanh(27*) for k = 1,2,3,..., stored in the array 
cordic_htab[]. The algorithm needs a modification: the iteration starts with index one and some steps 
have to be executed twice. The sequence of the indices that need to be processed twice is 4, 13, 40, 121, ... 


(io = 4, tng = 3 dg +1, entry A003462 in [812]). 
A sample implementation is given in [FXT: |arith/cordic-hyp-demo.cc : 


void 
cordic hyp(double theta, double &s, double &c, ulong n) 
1 


double 
double 
double 
double 
// [PRINT] 

ulong i = 4; 

for (ulong k-1; k<n; ++k) 


cordic_1Kp; 
theta; 
1.0; 


SNS M 
uou dw og 


v *= 0.5; 
again: 

double d= ( z 

double tx 

double ty 


650 Chapter 33: Computing the elementary functions with limited resources 


ay 


} 


The values for the first steps of the computation for the argument 0 = z; = 1.0 are given in figure|33.2-B 
The scaling constant K’ can be computed as 


K = ]]v1-2*. [[v1- 27 (33.2-16a) 
k=1 k=0 

K’ = 0.8281593609602156270761983277591751468694538376908425291...  (33.2-16b) 

1 

gg = 1307497067763072128877721011310915836812783221769813422...  (33.2-16c) 


The duplicated indices appear twice in the product. The algorithm can be used for the computation of 
the exponential function using exp(r) = sinh(x) + cosh(x). The algorithm converges if —r' € 21 < r' 
where 


r = M arctanh(2~*) + V ^ arctanh(27*) (33.2-17a) 
k=1 k=0 
r’ = 1.118173015526503803610627556783092451806572942929536106...  (33.2-17b) 


With arguments 11, y, 21 we have 


a > K' (a, cosh(z;) + y sinh(z1)) (33.2-18a) 
y > K'(yı cosh(z;) + 21 sinh(zi)) (33.2-18b) 
z> 0 (33.2-18c) 

The backward version (d; :— — sign (yx)) computes 
x > K'qzi-Wy (33.2-19a) 
yo od (33.2-19b) 
z — 2, —arctanh (2) (33.2-19c) 

X1 

For the computation of the natural logarithm use log(w) — 2 arctanh g. That is, start with xı = w+ 1 


and yı = w — 1, then z > 3 log(w). 
The square root yw can be computed by starting with x4 = w + 1/4 and yı = w — 1/4, then z > K’ yw. 


For further information see [14], [179], and [256] chap.6]. An algorithm working with complex numbers 
is given in [36]. 


«O00 -1 O» Ct I UN A 
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Chapter 34 


Numerical evaluation of power series 


We give algorithms for the numerical evaluation of power series. If the series coefficients are rational, the 
binary splitting (binsplit) algorithm can be applied for rational arguments and the rectangular schemes 
for real (full-precision) arguments. As a special case of the binary splitting algorithm, a method for fast 
radix conversion is described. Finally we describe a technique for the summation of series with alternating 
coefficients. 


34.1 The binary splitting algorithm for rational series 


The straightforward computation of a series for which each term adds a constant amount of precision (for 
example, the arc-cotangent series with arguments 1) to a precision of N digits involves the summation 
of O(N) terms. To get N bits of precision one has to add O(N) terms of the sum, each term involves 
one (length-N) short division (and one addition). Therefore the total work is O(N?), which makes it 
impossible to compute billions of digits from linearly convergent series even if they are as ‘good’ as 
Chudnovsky's famous series for m (given in [102]): 


1 6541681608 Á / 13591409 (6k! (=1* 
= 3 FABIANA ENETAY 3k (34.1-1a) 
a V/640320 {xy \ 945140134 (k!)3 (3k)! 640320 
12 2 (6k)! 13591409 + 545140134 - k 
= (=a) (34.1-1b) 
V/640320- 2 (1199 (3k)! (640320)3* 


34.1.1 Binary splitting scheme for products 


34.1.1.1 Computation of the factorial 


We motivate the binsplit algorithm by giving the analogue for the fast computation of the factorial. 
Define fm n := M: (m +1): (m+2)---(n— 1) - n, then n! = fin. We compute n! by recursively using 
the relation dun = fma t forin where x = |(m + n)/2]: 


indent(i)-for(k-1,8*i,printi(" ")); AN aux: print 8*i spaces 


F(m, n, i=0)= 
{ /* Factorial, self-documenting */ 
local(x, ret); 
indent(i); print( "F(", m, ", ", n, ")"); 
if ( m==n, /* then: */ 
ret =m; \\ == F(n,m) 
, /* else: */ 
x = floor( (m+n)/2 ); 
ret = F(m, x, iti) * F(x*1, n, i*1); 


indent(i); print( "^-- ", ret); 
return( ret ); 


} 


The function prints the intermediate values occurring in the computation. The additional parameter i 
keeps track of the calling depth, used with the auxiliary function indent(). Figure |34.1-A| shows the 
output with the computation of 8! =F(1,8). A fragment like 
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F(7, 8) 


says “F(5,6) called F(5,5) [which returned 5], then called F(6,6) [which returned 6]. Then F(5,6) 
returned 30.” For the computation of other products modify the line ret=m; as indicated in the code. 


Note that we compute the product in a depth-first fashion to obtain a localized memory access. An 
implementation of the scheme by computing products of pairs, pairs of pairs, etc., gives the identical 
result but is likely to suffer from cache problems. 


34.1.1.2 Computation of a polynomial from its roots 


Given the n roots a; of a polynomial C = y» Cj x) we can compute C by a trivial modification of the 
routine above: 


1 F(m, n, i=0)= 

2 i /* Polynomial by roots, self-documenting */ 

3 [--snip--] 

4 MN ret =m; \\ == F(m,m) 

5 ret = °x - m; NV == F(m,m) := (x - a_i) where a i is the i-th root 
6 [--snip--] 

7 


Here we choose the roots to be a; = i. The quantities with the computation of C = Mo (z — i) are 
shown in figure|34.1-B| The coefficient of this particular polynomial are the (signed) Stirling numbers of 


the first kind, see figure [11.1-A]on page 277] 
34.1.2 Binary splitting scheme for sums 


: N-1 , : 
For the evaluation of a sum grec ag we use the ratios Ry, of consecutive terms: 


Ry := (34.1-2) 
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F(1, 8) 


F(3, 4) 


== x72 - Tx + 12 
à - io*x*3 + Bb«x^2 - 50*x + 24 


-- x^4 
F(5, 8) 


F(7, 8) 


8 
4 - 67284*x^3 + 118124*x^2 - 109584*x + 40320 


Figure 34.1-B: Computation of the polynomial Il (a — i). 


Set a_1 :— 1 to avoid a special case for k = 0. We have 


N-1 
Mas =: Ro (1 Ra (1+ Ra (1+ Rs (1+...(1+ Ry-1)...))) (34.1-3) 
k=0 
Now define 
Rmn ‘= Rn(l+Rmi (...1+Rn)...)) where m<n (34.1-4a) 
Bua i= Em (34.1-4b) 
Then we have 
R : 5 (34.1-5) 
mn = a .1- 
i Gm-—1 E i 
and especially 
Ron = N ak (34.1-6) 
k=0 
We have 
Rm,n = Rm + Em: Rm41 + Rm: Rm+1: Rm2 +... (34.1-7a) 


+ Rare Re + Rms Rae (Rept b+ + Repi Ral 


E Ra + II Rx s Ra+tn (34.1-7b) 
k=m 


The product telescopes, one gets (for m € x « n) 


Rmn = Roa Regin (34.1-8) 
1 


m— 


Re 
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“== 7/8 
== 127/128 


NOOR WMH 


Figure 34.1-C: Quantities with the binsplit computation of 3d 2-(*0 — 127/128. 


34.1.3 Implementation using rationals 


Now we can formulate the binary splitting algorithm by giving a binsplit function using GP: 


1 Rm, n)= 

2 { /* Rational binsplit */ 

3 local(x, ret); 

4 if ( m==n, /* then: */ 

5 ret = A(m)/A(m-1); 

6 , /* else: */ 

T x = floor( (m+n)/2 ); 
8 ret = R(m, x) + A(x) / A(m-1) * R(x+1, n); 
9 , 

0 return( ret ); 

1 Jj 


Here A(k) must be a function that returns the k-th term of the series we wish to compute, in addition 
one must have a(-1)=1. For example, to compute arctan(1/10) one would use 


A(k)-if(k«0, 1, (-1)^ (0) /C(2*k*1) «10^ (2*k*1))) ; 
Figure |34.1-C|shows the intermediate values with the computation of oa 276+), 


34.1.4 Implementation using integers 


In case the programming language used does not provide rational numbers, rewrite formula |34.1-8| in 
separate parts for the denominator and numerator. With a; = pi/qi, p-1 = q-1 = 1 and Rmn =: 
U,, n / Và, one gets 


Um n = Pm-1 Qe Usa Via RE Px Im-1 Usi Via (34.1-9a) 
Vin =  Pm-1 x Via Via (34.1-9b) 


The following implementation also contains code for reduction to lowest terms: 


Q(m, n)= 
1 /* Integer binsplit */ 
local(x, ret, bm, bx, tm, tx); 
if ( m==n, /* then: */ 
bm = B(m); bx = B(m-1); 
ret = [ bm[1]*bx[2] , bx[1]*bm[2] ]; NN == B(m)/B(m-1); 
x = gcd(ret[1], ret[2]); /* Reduction */ 
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QU0, 6) 
QU0, 3) 
Qco, 1) 
Qco, 0) 
“== [1, 10] 
Q(1, 1) 
“== [-10, 3000] 
“== [29900, 300000] 
QA, 3) 
QA, 2) 
“== [3000, -500000] 
Q(, 3) 
“== [-500000, 70000000] 
“== [-104250000000000000, 17500000000000000000] 
“== [1569781275000000000000000000, 15750000000000000000000000000] 
[--snip--] 


Figure 34.1-D: Explosive growth of intermediate quantities with computing arctan(1/10). 


ret = [ret[1]/x, ret[2]/x]; /* Reduction */ 
, /* else: */ 


x = floor( (m+n)/2 ); 

tm = Q(m, x); \\ [U_{m,x}, V_{m,x}] 
tx = QG*1, n); AN (U_{xt1,n}, V_{m,n}] 
bm = B(m-1); \\ [p_{m-1}, q_{m-1}] 
bx = B(x); \\ [p_tx}, q_{x}] 


MV ret == Q(m, x) + B(x) / B(m-1) * Q(x+1, n); 
ret = [ (bm[i]*bx[2]*tm[1]*tx[2] + bx[1]*bm[2]*tx[1]*tm[21)/10, 
(bm [1] *bx [2] *tm[2]*tx[2])/10 ]; 
x = gcd(ret[1], ret[2]); /* Reduction */ 
ret = [ret[1]/x, ret[2]/x]; /* Reduction */ 


x 
return( ret ); 


Whether the reduction should be used depends on the terms of the sum. If arctan(1/10) is computed 
without the reduction, the intermediate quantities grow exponentially, as shown in figure The 
square brackets are the quantities [Um,n, Vm,n]. Such explosive growth will occur with all power series 
unless the function argument is 1. 


34.1.5 Performance 


We compute the sum for arctan(1/10) up to the 5,000th term with the direct method, the rational binsplit 
and the integer binsplit with and without reduction. The timings for the computation are: 


A(k)=if(k<0,1, (-1)^(k)/((2*k+1)*10^(2*k+1))); AN for rational binsplit 
B(k)=if(k<0, [1,1], [(-1)^(k), ((2*k+1)*107(2*k+1))] ); NN for integer binsplit 
N=5000; 

sum(k=0,N,A(k)); \\ direct method: 69,385 ms. 

R(0,N); \\ rational binsplit: 2,532 ms. 

Q(O,N); NN integer binsplit with gcd reduction: 4,152 ms. 

Q(O,N); NN integer binsplit without gcd reduction: >8min, "forever" 


Things look quite different when computing the sum XM (—1)*/(2k + 1)?. The intermediate quanti- 
ties U and V have only small common factors, so it is better to omit the reduction step: 


B(k)=if (k<O, [1,1], [(-1)^k, (2*k*1)^2] ); 
A(k)=if (k<0,1, (-1)7 (00/(2*k41) ^2) ; 
N=50000; 
sum(k=0,N,A(x)); \\ direct method: 32,396 ms. 
R(O,N); NX rational binsplit: 6,826 ms. 
Q(0,N); NN integer binsplit with gcd reduction: 27,485 ms. 
Q(O,N); NN integer binsplit without gcd reduction: 6,251 ms. 


Built-in routines for binsplit summation would likely be faster than these figures suggest. 


The reason why summation via binary splitting is better than the straightforward way is that its com- 
plexity is only O(log N- M(N)), where M(N) is the complexity of one N-bit multiplication (see [175]). If 
an FFT-based multiplication algorithm is used (M(N) ~ N - log N), the work is zz O((log N)? N). This 
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means that sums of linear but sufficient convergence are again candidates for high precision computations. 
The algorithm should be implemented in the ‘depth first’ manner presented, and not via the naive pairs, 
pairs of pairs, etc. (breadth first) way. The reasons are better locality and less memory consumption. 
The naive way needs the most memory after the first pass, when pairs have been multiplied. 


34.1.6 Extending prior computations 


To evaluate the sum to a higher precision, reuse Ro,v—1, the sum of the first N terms. For example, the 
sum of the first 2N terms can be computed as 


Ro2n-=1 = Ron-1+4n-1: Ry 2N-1 (34.1-10) 


This is formula |34.1-8| with m = 0, x = N — 1, and n= 2N — 1. The same relation with explicit rational 
arithmetic is 


Uo2N-1 = qw-iUo,N-iVN,2N-1 + Pn-1 UN,2Nn-1 Vo,N-1 (34.1-11a) 
Voov-i = N-iVo,v-iVN32N-i (34.1-11b) 


With the appearance of some new computer that can multiply two length 2N numbers (assuming the old 
model could multiply length-N numbers) we only need to combine the ratios Ro,v—1 and Ry,2w-1 that 
had been precomputed by the last generation of computers. This costs only a few full-size multiplications, 
so we can improve on prior computations cheaply. 


34.1.7 Computation of 7: binary splitting versus AGM-type iterations 


The binary splitting scheme for the computation of m (for example, with the series |34.1-1a on page 651) 
can outperform the AGM-based iterations given in section |31.5 on page 615 


'This is due to the more favorable memory access pattern with binary splitting. When computing N digits 
of m the AGM iterations compute O (log, N) roots (and or inverses) to full precision. At the last phase 
of each root computation full-length multiplications that access all of the memory have to be performed. 
In contrast, the binary splitting involves full-precision multiplications only at the very last phase. 


The drawback of the binary splitting scheme is that it may need significantly more memory than two full 
words. This may happen if the numerator and denominator grow fast which is more likely if no series 
so favorable as can be used for the quantity to be computed. The problem can be mitigated 
by computing the floating-point value whenever the integer values become too large (as pointed out by 
Richard Kreckel [priv. comm.], see [219] and [100]). This technique is used in the CLN library [174]. 


34.1.8 Fast radix conversion 


A binary splitting scheme for radix conversion of a radix-z integer [ayay-1...a24109], into the radix 
used is obtained via recursive application of the scheme 


N M+X-1 N 
y akz" = 5 ap z? put 5 ap z" (34.1-12) 
k=M k=M k=M+X 


where X is chosen to be the largest power of 2 that is less than d := N — M. 


In the following we pretend that the computations with GP are done in decimal. While this is not true 
(the radix used internally is binary and the numbers are converted to decimal only with printing), nothing 
in the output would appear different with decimal calculations. 


We define an auxiliary function that computes (for d > 1) the largest exponent s so that 2° < d: 


ex2le(d)- 

{ /* return largest s so that 2^s « d */ 
local(s, t); 
t-1; s=0; 
while ( d>t, t««-1; st=1; ); 


34.1: The binary splitting algorithm for rational series 657 


R(O, 15) 
R(O, 7) 
R(O, 3) 
RC, 1) 
R(2, 3) 
^22'101 
^-- 25923 
(4, T) 
R(4,_5) 
R(6, 7) 
uu “=="33 
ceg 564132803 
' —^R(8, 11) 
R8, D 
R(10, 11) 
== d 
“== 2 
R(12, 15) 
R(12, 13) 
R(14, 15) 
“== 33 


“== 554132803 
“== 2379982267079943491 


Figure 34.1-E: Intermediate results when converting the number 210765432107654316 to decimal. 


6 t >>= 1; s--; 

7 return(s) ; 

8 

We precompute 2?, 24, 28, ..., 22 where 2” < N: 

1 z=16; AN radix 

2 N-15; \\ number of digits in radix z 

3  vz-vector(ceil(log(N)/1og(2))); 

4 vz[1]-z; for (k-2, length(vz), vz[k]-vz[k-1]^2); // O(N) space 


Now the conversion function can be defined as 


1] Ri(m, n, i=0)= 
2 4 /* Radix conversion, self-documenting */ 
3 local(x, d, ret, t); 
4 indent(i); print( "R(", m, ", ", n, ")"); 
5 d = nm; 
6 if (d <= 1, /* then: */ 
7 if ( d==0, ret = A(m); , ret = A(m) + z*A(n); ); 
8 , /* else: */ 
9 t = ex2le(d); 
x = 1<<t; 
11 ret = Ri(m, m+x-1, iti) + vz[t*1] * Ri(mtx, n, i*1); 
12 J; 
13 indent (i); print( "^-- ", ret); 
14 return( ret ); 
15 } 


We convert the 16-digit, radix-16 number 
A = 210765432107654316 = [a15 014 ... 0201 à9]16 (34.1-13) 


The intermediate results are shown in figure |34.1-E| The k-th digit of A is a, = (k + 3) mod 8, it is 
supplied as the function A(k)=(k+3)7%8. 


CON MDOBWNrF NDIA WUN 


658 Chapter 34: Numerical evaluation of power series 


34.2 Rectangular schemes for evaluation of power series 


The rectangular scheme for the evaluation of polynomials was given in [267] and later in [314]. We use it 
for the evaluation of truncated power series up to a given power N — 1 of the series variable. We look at 
two variants, one for series whose coefficients are small rationals (as for the logarithm) and another for 
series where the ratios of successive coefficients are small rationals (as for the exponential function). If 
the numbers of rows and columns in the schemes are identical, a method involving O(VN) full-precision 
multiplications is obtained. The schemes are very competitive up to very high precision in practice, even 
compared with AGM-based methods. 


34.2.1 Rectangular scheme for arctan and logarithm 


Computing the sum of the first N terms of a power series as 
N-1 
Sy = Y Ak" = Ap+z(Art+2(A4ot+2(Agt+...2(Aw-1)--.))) (34.2-1) 
k=0 


costs N long (full-precision) multiplications if z is a full-precision number. If the A; are small rational 
values and N = R- C, then we can rewrite Sy as 


SN = Aoc + Aocaa Zz + Aoc+2 oe + Aic-1 got + (34.2-2) 
+2 [Aic + Alogi 2 + A1042 22 +... + A20-1 20 1 
+2 [A20 + A2041 Z + À2042 22 +... + Ago 20 + 
+20 [Asc + Asc 2 + Asoo 27 +... + Ascii 2071+ 
+ PS + 
+2% [Aue + A(r-1y0412 + Arot? +... + Aro1z°" ].. I] 
We compute Sy as 
[L a [Ur-1] z+... + Us] go + U>] gu Ui] B Uo (34.2-3) 
where U, := sm. Arc+k 2" is the sum in one row of relation |34.2-2 
Precomputing the quantities z2, z°, 2%,...2 involves C — 1 long multiplications. The sums in each 


row of expression [342-2] involve only short multiplications with series coefficients A;. The multiplication 
by z© for each but the first row involves further R — 1 long multiplications. The computation uses 
C temporaries (z, z?, ..., 2) and O(R + C) long multiplications. Choosing R = C = VN leads to a 
complexity of O(2V/N) long multiplications and also involves VN temporaries. With argument reduction 
the complexity can be improved to O(WN) multiplications, see [83] p.25]. 


34.2.1. Implementation for arctan 


We implement the scheme for the arctan in GP: 


fa(n) = \\ inverse of series coefficient 
{ /* fa(n) := (-1)^n/(2n*1) */ 
local(an); 
an - (2xn*1); 
if ( bitand(n,1), an--an); 
return( an ); 
} 
atan_rect(z, R, C)= 
{ /* compute atan(z) as z*(1-z^2/3*z^4/5-z^6/T*-... *-z^ (2*(R*C-1))/(2*R*C-1) */ 


local(S, vz, s, ur, k); 

vz = vector(C); \\ vz == [z^2,z^4,z2^6,...,z^ (2«C)] 

vz[1] = z*z; AN 1 long multiplication (special for arctan) 

for (k-2, C, vz[k]=vz[1]*vz[k-1]); \\ C-1 long multiplications 
k = R*C; \\ index of current coefficient 

s=0; AN sum 
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9 forstep (r-R-1, 0, -1, 

10 ur = 0; \\ sum of this row 

11 forstep (c=C-1, 1, -1, k-=1; urt=vz[c]/fa(k); ); 

12 k -= 1; ur += 1/fa(k); 

13 if ( r!-R-1, s*-vz[C]; ); AN R-1 long multiplications 
14 S += ur; 

15 ); 

16 S *= z; \\ 1 long multiplication (special for arctan) 

17 return( s ); 

18 


age 626|twice on z = 1), using a precision of 30,000 decimal digits. We use R = C = VN =: S: 


? ? z-l;z-z*sqrt(z^2*1);z-z*sqrt(z^2*1);z-1/z; \\ ==> Pi/16 
? a-atan(z); \\ built-in arctan: computed in 1,123 ms. 
? r=atan_rect(z,S,S); WW computed in 2,377 ms. 

\\ using S=147, and N-8^2-21609 


} 
We compute 7/16 as arctan(z) where z = V 2V2 + 4 — V2 — 1 ~ 0.19891236 (using relation |32.1-18 on 
[age 626 


? a-r 
0.E-30017 MN result OK 


The given implementation is about half as fast as the built-in routine. Argument reduction makes the 
method much more competitive: 


a-atan(z); NX computed in 1,123 ms. 
z-1/z; 
for(k-1,32,z-z*sqrt(z^2*1)) \\ computed in 204 ms. 


z-1/z 

.57161899770987400328861548736 E-11 

? S=ceil (sqrt (-1/2*rp*log(10)/log(z))) 

39 NN N-8^2-1521 

? r=atan_rect(z,S,S); \\ computed in 284 ms. 
2 r*-2^32 


? a-r 
-1.3690050398194919519 E-30016 \\ OK 


With 100,000 decimal digits the performance ratio is roughly the same. Note that one has to limit the 
number C of temporaries according to the available memory. 


Compute the inverse sine and cosine as 


z 
arcsin(z) = arctan ———— 34.2-4a 

y A=2 ae 
arccos(z) — 5 — arcsin(z) (34.2-4b) 


34.2.1.2 Implementation for the logarithm 


A routine for log(1 — z) is 


1 log_rect(z, R, C)= 

2 { /* compute log(1-z) as 1*x/2*x^2/3*...*x^(R*C-1)/(R*C) */ 

3 local(S, vz, s, ur, k); 

4 vz = vector(C); \\ vz == [z^2,z^4,z2^6,...,z^ (2«C)] 

5 vz[1] = z; 

6 for (k-2, C, vz[k]=z*vz[k-1]); \\ C-1 long multiplications 
7 k = R*C; \\ index of current coefficient 

8 s=0; AN sum 

9 forstep (r=R-1, 0, -1, 

10 ur = 0; \\ sum of this row 

11 forstep (c=C-1, 1, -1, k-=1; urt=vz[c]/(kt1); ); 

12 k -= 1; ur += 1/(k*1); 

13 if ( r!-R-1, s*-vz[C]; ); \\ R-1 long multiplications 
14 s += ur; 

15 23 

16 S *- z; \\ 1 long multiplication (special for arctan) 

17 return( -s ); 

18 5 


However, using a precision of 30,000 decimal digits and argument z = 1/5 the routine is slower than the 
built-in one (using the AGM) by a factor of about 1/7. With argument reduction (relation [32.1-15 on 
page 625) and R= C = VN =: S we get a more competitive performance: 


NOOB CS. orn 
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? e-log(i-z) NN computed in 621 ms. 


30. 223143551314209755766295090310 
? z-l-z; 


? for(k-1,32,z-sqrt(z)); \\ computed in 132 ms. 

? z=1-z; \\ == 5.19546566783481003872552738341 E-11 
? S=ceil (sqrt (-rp*log(10) /log(z))) 

55 WX N=872=3025 

? r-log rect(z,N); AN computed in 461 ms. 


7 r*-2732 

30 .223143551314209755766295090310 

? e-r 

-6.071613129762050924 E-30008 \\ OK 


We note that with both the logarithm and the arctan, subsequent computations with the built-in routine 
are faster as some constants that are computed with the first call are reused. 


Compute the inverse hyperbolic sine and cosine as 


arcsinh(z) = log(z + y 2? 4 1) (34.2-5a) 
arccosh(z) = log(z+ Vz?—1) where z>1 (34.2-5b) 


34.2.2 Rectangular scheme for exp, sin, and cos 


We rewrite the sum of the first N terms of a power series 


N— 
XO Ak" (34.2-6) 
k=0 
as 
SN = 1 [Aoc+0 + Arco ge + A»c40 ge Tear At-1)c4oz ADO] + (34.2-7) 
g [Aoc + Aic ZI + Ago 27° +... A-ne FDO] 
e [Aoc+2 + Aica42 27° + Agoya 270 E Lii Ar-nc+22 796] + 
a [Aoc+s + A1c43 21€ + A204320 +... + At-)css2 DO] 
+ + 
ims VER + Ago—2 21€ + Azo22 +... Aro-22 00| + 
p [Aca + A263 214 + Agcca 27€ +... + Ago 12 BDC] 
Compute the sum as (the transposed version of relation |34.2-2 on page 658) 
Sn = [[[. -- [Uc-1] z + Uc-2] z+... + Us] z + U2] z + U1] z + Uo (34.2-8) 
where U. = Da o Ancte Zh C (C temporary sums are computed). If proceeding colum-wise, the update 


Ai > Aij involves only a short multiplication by the ratio A;,1/A;. Only when going to the next 
column a long multiplication by z© is required (R— 1 long multiplications). Finally, there are C — 1 long 
multiplications by z. 


34.2.2.1 Implementation for the exponential function 


A routine for the computation of exp(z) — 1 can be given as follows: 


exp rect(z, R, C)= 

{ /* compute exp(z)-1 as z*[ 1*z/2!*z^2/3! *...*z^(R*C-1)/C((R*C)!) ] */ 
local(ur, zc, k, t); 
zc = z^C; \\ proportional log(C) long multiplications 
ur = vector(C); 
k 1; MAN ratio of series coefficients /* set to zero for plain exp */ 
t 1.0; 
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8 for (r=1, R, \\ number of columns (!) 

9 for (c=1, C, ur[c] += t; k++; t /= (k); ); 

10 if ( r!-R, t *= zc; ); \\ R-1 long multiplications 

11 J; 

12 t = ur[C]; 

13 forstep (c=C-1, 1, -1, t*-z; t+=ur[c]); \\ C-1 long multiplications 
14 t *= z; /* omit for plain exp */ 

15 return( t ); 

16 $ 


«oo-1o0»c0uÓ99rb€-— 
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We use the argument reduction given as relation|32.2-14 on page 629|and compute exp(1/5) to a precision 


of 30,000 decimal digits. We use R = C = VN =: S: 


z-0.2; 
e=exp(z) \\ computed in 855 ms. 


.22140275816016983392107199464 
nred-32; 


z/-2^nred 

-65661287307739257812500000000 E-11 

S=48; NN N=S*2=2304 

r= -exp rect(z,S,9) \\ computed in 395 ms. 
.65661287318581279537523334667 E-11 

ae 1,nred,r=r+r+r"2); \\ computed in 68 ms. 


=1 
.33140275816016983392107 199464 
.965120231677044083 E-30016 \\ OK 
The timings for 100,000 digits, nred=112, and 8-52 are: 


NIE DS o BY RE HS) 


? e=exp(z); \\ computed in 8,601 ms. 
? r=exp_rect(z,S,S); \\ computed in 2,345 ms. 
? for(k-1,nred,r-r*r*r^2); NX computed in 1,640 ms. 
34.2.2.2 Implementation for the cosine 


A routine for computing cos(z) — 1 can be given as 


cos_rect(z, R, C)= 

{ /* compute cos(z)-1 as z^2*[ -1/2!*z^2/4! - z^4/6! +- ... ] */ 
local(ur, zc, k, t); 
Zz *= Z; 


zc = z^C; \\ proportional log(C) long multiplications 
r = vector(C); 

2; \\ ratio of series coefficients 

-0.5; 

or (r=1, R, \\ number of columns (!) 


for (c=1, C, ur[c] += t; k++; t /= (k); k++; t /= -(k); D; 
if ( r!-R, t *= zc; ); \\ R-1 long multiplications 

); 

t = ur[C]; 


forstep (c=C-1, 1, -1, t*-z; t+=ur[c]); \\ C-1 long multiplications 


t *-z; /* omit for plain exp */ 
return( t ); 
} 
We use the argument reduction as in relation|32.2-16 on page 629/and compute cos(1/5) to 30,000 decimal 
digits: 
7 z=0.2; 


e-cos(z) \\ computed in 788 ms. 


.980066577841241631124196516748 
nred-32; 


z/-2^nred; 

S=34; AN N-8^2-1156 

r=cos_rect(z,S,S); AN computed in 318 ms. 

ee 1,nred,r=2*(r+1)72-2); AN computed in 70 ms. 


=1 
-S80066577841241631124196516748 
ae 646143951667310362 E-30017 \\ OK 


NON VY NNN YO Y 


662 Chapter 34: Numerical evaluation of power series 


The sine and tangent can be computed as 


sin(z) = 1 — cos(z)? (34.2-9a) 
tan(z) = ME (34.2-9b) 


The routine is easily converted to compute the hyperbolic cosine. The following relation gives an alter- 
native way to compute the exponential function: 


exp(z) = cosh(z) — y cosh(z)? — 1 (34.2-10) 


34.3 The magic sumalt algorithm for alternating series 


The following convergence acceleration algorithm for alternating series is due to Cohen, Villegas and 
Zagier, see [111]. As remarked in the cited paper, the algorithm often gives meaningful results also for 
non-alternating and even divergent series. 


The algorithm computes an estimate of the sum s = bad Tj as 


n—1 


Sn 237 (34.3-1) 
k=0 


The weights cn, do not depend on the values zj. With the following pseudocode the summands z; have 
to be supplied in the array x[0,1,...,n-1]: 
function sumalt(x[], n) 


(3*sqrt(8))^n 
(a+1/a)/2 


or k:=0 to n-1 


* x[k] 


s = + c k 
= b * (2*(n+k)*(n-k)) / ((Q*k+1)*(k+1)) 


return s/d 


With alternating sums the accuracy of the estimate will be (3 + /8)-" z 5.827". For example, the 
estimate for 4- arctan(1) using the first 8 terms is 


l 1 1 1 1 1 1 1l 
TOR «( + + + ) = 3.017... (34.3-2) 


1 3 5 7 9 11 13 15 


The sumalt-massaged estimate with 8 terms is 


665856 665728 | 663040 641536 | 557056 376832 163840 32768 
| | | 4.3- 
i ( I 3 5 7 9 11 13 15 ) /665857 poen) 


4 - 3365266048/4284789795 = 3.141592665... 


and already gives seven correct digits of 7. The linear but impressive growth of the accuracy of successive 
sumalt estimates with n, the number of terms used, is illustrated in figure 
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n: sumalt (n) sumalt (n)—7 
1: 2.666666666666666666666 0.474925986923126571795 
2: 3.137254901960784313725 0.004337751629008924737 
3: 3.140740740740740740740 0.000851912849052497721 
4: 3.141635718412148221507 —0.000043064822354983044 
5: 3.141586546403673968348 0.000006107186119270114 
6: 3.141593344215659403660 —0.000000690625866165197 
T: 3.141592564937540122015 0.000000088652253116447 
8: 3.141592665224315864017 —0.000000011634522625555 
9: 3.141592652008811951619 0.000000001580981286843 
10: 3.141592653809731569318 —0.000000000219938330856 
11: 3.141592653558578755513 0.000000000031214482948 
12: 3.141592653594296338470 —0.000000000004503100007 
13: 3.141592653589134580517 0.000000000000658657944 
14: 3.141592653589890718625 —0.000000000000097480163 
15: 3.141592653589778664375 0.000000000000014574087 
16: 3.141592653589795436775 —0.000000000000002198312 
I7: 3.141592653589792904285 O .000000000000000334177 
18: 3.141592653589793289614 —0.000000000000000051151 
19: 3.141592653589793230584 0.000000000000000007877 
20: 3.141592653589793239682 —0.000000000000000001220 
Figure 34.3-A: Sumalt-estimates of 7 = 4- arctan(1) using n = 1,2,...,20 terms. 
Therefore even slowly converging series like 
ED 
r= 4 y ag Ts arctan(1) (34.3-4a) 
k=0 
cx qd 
C = 5 Qk+iz T 0.9159655941772190... (34.3-4b) 
k=0 
oo k 
lg2 = M ED" 0.6931471805599453 (34.3-4c) 
g = 2 e e $i : 
1 e. (-1)* 
(5) = yom 5 ( E (34.3-4d) 


k=1 


can be used to compute estimates that are correct up to thousands of digits. The algorithm scales like n? 
if the series terms in the array x[] are small rational values and like n? - log(n) if they are full precision 


(rational or float) values. 


In fact, GP has a built-in sumalt routine, we use it to compute the Catalan constant: 


sumalt (k=0, (-1) “k/(2*k+1) 72); 
sumalt (k=0, (-1)^k/(2*k*1) 72); 
sumalt (k=0, (-1) ~k/(2*k+1) 72); 


? default (realprecision, 1000) ; 
? default (realprecision, 2000) ; 
? default (realprecision, 4000) ; 


\\ takes 
\\ takes 


The time scales roughly with the third power of the precision used. 


60 ms. 
376 ms. 
\\ takes 2,730 ms. 


The values cz and bz occurring in the computation are integers. In fact, the bz in the computation with 
n terms are the coefficients of the expanded n-th Chebyshev polynomial of the first kind with argument 


1+ 2x: (see section |35.2 on page 676): 
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k: bk Ck 
0: 1 665857 
1: 128 665856 
2: 2688 665728 
3: 21504 663040 
4: 84480 641536 
5: 180224 557056 
6: 212992 376832 
[e 131072 163840 
8: 32768 32768 
T&(l--2z) =  1--128x + 26882” + 215042? + 84480z* + (34.3-5a) 
+180224x? + 21299229 + 1310722" + 327684? = Tig ( 1 + x) 
Ti6(x) = 1- 1282? + 26882* — 21504a + 844801* — (34.3-5b) 


—1802244:1? + 21299271? — 131072r!4 + 32768119 


Now observe that one has always cn = bn = 27-1 in a length-n sumalt computation. The computation 
of (3 + V8)” can be avoided by the following variant: 


1 function sumalt(x[], n) 
2 
3 b := 2**(2*n-1) 
1 c:=b 
S := 0 
for k:=n-1 to O step -1 
its { 
8 s := s + c * x[k] 
9 b := b * ((2*k*1)*(k*1)) / (2x(n+k)*(n-k)) 
10 c:=ctb 
11 
12 return s/c 
13 


The bz and cz occurring in a length-n sumalt computation can be given explicitly as 


n n+k 
b — 92% 4.3- 
t zi 2k ) Paese) 
= n NM+1t)\ os 
— 94 4.3-6b 
Wi xu) (PEE) 


To compute an estimate of Dz, 7% using the first n partial sums use the following pseudocode (the 
partial sums pj, = Nn x; are expected in p[0,1,...,n-1]): 


function sumalt_partial(p[], n) 


1 
2 
3 d := (3*sqrt(8))^n 
4 d := (d+1/d)/2 
b := 1 
c:=d 
S := 0 
for k:=0 to n-1 
9 { 
10 S := S +b * pik] 
5 3 b := b * (2*(n*k)*(n-k)) / ((2*k+1)*(k+1)) 
13 return s/d 
14 } 


'The backward variant is: 
1 function sumalt partial(p[l, n) 
2 

3 b 
1 c 


2** (2*n-1) 

:= b 

S := 0 

for k:=n-1 to 0 step -1 
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7 { 

8 s := s * b * pik] 

9 b := b * ((2*k+1)*(k+1)) / (2*(n+k)*(n-k)) 
10 6 e b 

11 } 

12 return s/c 

13 $ 


Implementations of the sumalt algorithm and the variant for partial sums are given in [hfloat: 


src/hf/sumalt.cc|. 


For series of already geometrical rate of convergence (where |a, /aj.,1| ~ e) it is better to use 


1 function sumalt partial(p[l, n, e) 
2 4 
3 d := ( 2xe + 1 + 2*sqrt(e*(e*1)) )^n 
4 d := (d+1/d)/2 
: b := 1 
c:=d 
S := 0 
for k:-0 to n-1 
9 1 
10 S :7 S +b * pik] 
11 b := b * (2x(n+k)*(n-k)) / ((2#k+1)*(kt+1)) * e 
12 
13 return s/d 
14 +} 


—n 
Convergence is improved from ~ e^" to ~ (2c +1+2ye(e + D) & (4e +2)". The special case 
e — 1 gives the original sumalt algorithm. For a survey of methods for convergence acceleration see [351]. 


666 Chapter 35: Recurrences and Chebyshev polynomials 


Chapter 35 


Recurrences and Chebyshev 
polynomials 


We look at several algorithms for recurrences, mostly for the case of constant coefficients. The Chebyshev 
polynomials are described as an important special case of a recurrence. 


35.1 Recurrences 


A sequence [ao, a1, a»,...] so that a recurrence relation 
k 
An = Som; An—j (35.1-1) 
j=l 


with given m; holds for all a; is called a k-th order recurrence. The recurrence is linear, homogeneous, 
with constant coefficients. The sequence is defined by both the recurrence relation and the first k elements. 


For example, the second order recurrence relation a, = 1 an—1 + 1 an—2 together with ag = 0 and a; = 1 
gives the Fibonacci numbers Fn, starting with ay = 2 and a, = 1 gives the Lucas numbers Lp: 


n:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
F(n): 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 
L(n): 2 1 3 4 7 11 18 29 47 76 123 199 322 521 843 


The characteristic polynomial of the recurrence relation|35.1-1|is given by 
k . 
ps). = 2 =S ma (35.1-2) 
j=l 


The definition can be motivated by writing down the recurrence relation for the element with index n = k: 


k 
0 = a S mj ay; (35.1-3) 
j=l 


35.1.1 Fast computation using matrix powers 


To compute the recurrence defined by the recurrence relation 


An I= M1 An—1 + M2 An—2 (35. 1-4) 
and the initial values ag, a4, use 
0 m 5 
[ao, a1] l i = Gg. ak+ı] (35.1-5) 


The algorithm is fast when powering algorithms (see section |28.5) are used. 
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Note that two consecutive terms of the sequence are computed, so the following terms ax+1, ag42, ... 
can easily be computed by the original recurrence relation. 


The generalization is straightforward. For example, a recurrence a, = Mı 4-1 + M2 Gn_2 + Mz An_3 
corresponds to 


k 
0 0 m3 
lao, a1, a3] [1 0 ma = [ak, ak+1, Ag+2) (35.1-6) 
0 1 m3 


The matrix is the companion matrix of the characteristic polynomial x? — (mi £? + ma x! + ma 29), see 


relation |42.5-1 on page 899| Note that the indexing of the my is different here. 


Performance 


The computations are fast. As an example we give the timing of the computation of a few sequence terms 
with large indices. The following calculations were carried out with exact arithmetic, the post-multiply 
with the float 1.0 renders the output readable: 


? M=[0,1;1,1] \\ Fibonacci sequence 
T d 
timer = 1 (on) 
? ([0,1]*M^10000) [1] *1.0 
time - 1 ms. 
3.364476487643 E2089 
? ([0,1]*M^100000) [1]*1.0 
time - 10 ms. 
2.597406934722 E20898 
? ([0,1]*M^1000000) [1]*1.0 
time = 458 ms. 
1.953282128707 E208987 
The powering algorithm can also be used for polynomial recurrences such as for the Chebyshev polyno- 
mials T; (x): 


? M-[0,-1;1,2*x] 


[0 -1] 
[1 2x*x] 
? for(n=0,5,print(n,": ",([1,x]*M^n) [1])) 
0: 1 
1: x 
2: 2*x72 - 1 
3: 4*x^3 - 3*x_ 
4: 8xx"4 - 8xx"2 + 1 
b: 16*x75 - 20*x^3 + 5*x 


? p-CL1,x]*M^1000) [1]; 
time - 1,027 ms. 

? poldegree(p) 
1 


000 
? log(polcoeff(p,poldegree(p)))/log(10) 
300.728965668317 \\ The coefficient of x^1000 is a 301-digit number 


With modular arithmetic the quantities remain bounded and the computations can be carried out for 
extreme large values of n. We use the modulus m = 21279 — 1 and compute the n = (m + 1)/4 element 
of the sequence 2,4,14,52,... where an = 4an_1 — Gn_2: 


? m-2^1279-1; \\ a 1279-bit number 
? log(m)/log(10) 
385.0173 NX 306 decimal digits 
? M=Mod([0,-1;1,4],m); AN all entries modulo m 
? lift( ([2,4]*M^ ($m*1)/2)) [1] ) 
gine = 118 ms. 


The result is zero which proves that m is prime, see section |39.11.4| Here is a one-liner that prints all 
exponents e < 1000 of Mersenne primes: 


? forprime(e=3,1000,m=2"e-1;M=Mod([0,-1;1,4],m);if(0==(([2,4]*M” ((mt+1)/4)) [1]) , print1(" ", e))) 
3 5 7 13 17 19 31 61 89 107 127 521 607 


'The computation takes a few seconds only. 


The connection of recurrences and matrix powers is investigated in [40]. 


«e oo-1oc um ccr. 


NOT AWUN 
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35.1.2 Faster computation using polynomial arithmetic 


The matrix power algorithm for computing the k-th element of an n-th order recursion involves O(log k) 
multiplications of n x n matrices. As matrix multiplication (with the straightforward algorithm) is O(n?) 
the algorithm is not optimal for recursions of high order. Note that the matrix entries grow exponentially, 
so the asymptotics as given is valid only for computations with bounded values such as with modular 
arithmetic. We will see that the involved work can be brought down from log k - n? to log k - n? and even 
to log k: n- logn. 


The characteristic polynomial for the recursion a,, := 3 an—1 + 124.2 + 284.3 is 
p(z) = a?—3a?—1z-2 (35.1-7) 


We list the first few powers of the companion matrix M of p(x): 


l3 002 02 6 2 6 20 
M=|0 1 0 M'=|1 0 1 Mé br 1 5 M?= |1 5 16| (35.1-8) 
0 0 1 0.1 3 1 3 10 3 10 35 


Each power is a left shifted version of its predecessor, only the rightmost colum is ‘new’. Now compare 
the columns of the matrix powers to the first few values z* modulo p(z): 


x°? mod p(z) = Oa? +0r+1 (35.1-9a) 
z!modp(r) = O2?+12+0 (35.1-9b) 
xr? mod p(x) = 12?+02+0 (35.1-9c) 
xr? mod p(x) = 3a?-1z42 (35.1-9d) 
a* mod p(z) = 1027+52+6 (35.1-9e) 
a* mod p(x) = 35274162 +20 (35.1-9f) 


Observe that z^ mod p(x) corresponds to the leftmost column of M*. 


We now turn the observation into an efficient algorithm. The main routines in this section take as 
arguments a vector v of initial values, a vector m of recursion coefficients and an index k. The vector 
r = [ax, ak+1, ---, @k+n] is returned. We compute the leftmost column of M* as z := z* mod p(x) and 
compute az as the scalar product of z (as a vector) and v. Our main routine is: 


frec(v, m, k)= 

local(n, pc, pv, pp, px, r, t); 

n = length(m); 

if ( k<=n, return( recstep(v, m, k) ) ); \\ small indices by definition 

pc = vec2charpol(m); 

pp = Mod( x, pc ); 

px = pp (k); 

r = vector(n); 

for (i=1, n, 
t = lift (px); 
r[i] = sum(j=1,n, v[jl*polcoeff(t,j-1,x)); 
px *= pp; 


25 
return( r ); 


} 


If only the value a; is of interest, skip the computations in the final for loop for the values i> 1. 


For small indices k the result is computed directly by definition, using the following auxiliary routine: 


recstep(v, m, k)= 
{ /* update v by k steps according to the recursion coefficients in m */ 
local(n,r); 
if ( k«-0, return(v) ); AN negative k is forbidden 
n = length(m); 
r = vector(n); 


for (i21, k, 
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8 for (j=1, n-1, r[jl=v[j+1] ); NN shift left 
9 r[n] = sum(j=1,n, m[n+1-j]*v[j]); MAN new element (convolution) 
10 y rj 
11 25 
12 return( r ); 
13 3 
The auxiliary routine used to compute the characteristic polynomial corresponding to the vector m is: 
vec2charpol(m)- 
{ /* return characteristic polynomial for the recursion coefficients in m */ 
local(d,p); 


d = length(m); 
p = x°d - Pol(n,x); 
return( p ); 


NOOR Cb HD 


} 


The computation of the k-th element of an n-term recurrence involves O(logk) modular polynomial 
multiplications. Thus the total cost is O (log k- M(n)) where M(n) is the cost of the multiplication of 
two polynomials of degree n. That is, the method is O (log k- n?) when usual polynomial multiplication 
is used and O (log k - n- logn) if an FFT scheme is applied. The computational advantage of powering 
modulo the characteristic polynomial versus matrix powering is pointed out in [77] p.392] (page 4 of the 
preprint). 


The matrix power algorithm, restated for the argument structure defined above, can be implemented as: 


return (v* M); 


1 mrec(v, m, k)= 

2 

3 local(p,M); 

4 p = vec2charpol (m) ; 
5 M = matcompanion(p) ; 
6 M = M^k; 

7 

8 


} 
All main routines can be used with symbolic values: 


? frec([a0,a1], [m1,m2] , 3) 

[n2*mi*aO + (mi^2 + m2)*ai, (m2xm1^2 + m2^2)*a0 + (m173 + 2*m2x*m1)x*a1] 
? mrec([a0,a1] , [n1,m2] , 3) NN same result 
? recstep([a0,a1], [m1,m2],3) \\ same result 


Performance 


We check the performance of our routines (suppressing output): 


? k-10^5; 
? recstep([0,1],[1,1],k); 

time = 2,811 ms. \\ time linear in k 
? mrec([0,1],[1,1],k); 

time = 10 ms. \\ time linear in log(k) 
? frec((0,1],[1,1],k); 

time = 4 ms. \\ time linear in log(k) 


The relative performance of the routine frec() and mrec() differs more with higher orders n of the 
recurrence, we use n = 10: 


? n=10; v=vector(n); v[n]-1; m-vector(n,j,1); k-10^5; \\ tenth order recurrence 
? mrec(v,m,k); 
time = 2,813 ms. 
? f=frec(v,m,k); 
time = 159 ms. 
? log(f)/log(10.0) 
[30078.67, 30078.97, 30079.27, 30079.58, 30079.88, 30080.18, \ 
30080.48, 30080.78, 30081.08, 30081.38] \\ about 30k decimal digits each 


We see a performance gain of 2813/159 ~ 17.7, which is even greater than n. Finally, we repeat the 
computations modulo p = 29?! — 1 for k = 10?9: 

? n-10; v-vector(n); v[n]=1; m-vector(n,j,1); k=10730; 

? p=27521-1; v=Mod(v,p); m-Mod(m,p); 


? mrec(v,m,k); 
time - 312 ms. 
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? frec(v,m,k); 
time = 14 ms. 


Here the performance gain is 312/14 ~ 22.3. 


35.1.3 Inhomogeneous recurrences 


The fast algorithms for the computation of recurrences only work with homogeneous recurrences as defined 
by relation [35.1-1]on page [666] An inhomogeneous recurrence is defined by a relation 


k 
an = 5 mj an—j + P(n) (35.1-10) 
j=1 


where P(n) is a nonzero polynomial in n. We will show how to transform an inhomogeneous recurrence 
into a homogeneous recurrence of greater order. 


35.1.3.1 Recurrence relations with a constant 


Let the (k-th order) recurrence relation be 
An = M1 An—1 + M2 an—2 +... + mg üs 4C (35.1-11) 
Now subtract a shifted version 
An—1 = M1 An—2 + M2 An—3 +... + MpUn=k=1 + C (35.1-12) 
to obtain a recurrence of order k + 1: 


an = (m4 + 1) an_1 + (mo — mi) a4» +... + (Mk — my 1) An—k (35.1-13) 


An example should make the idea clear: with a, = 34a,_1 — a4.2 + 2 subtract the shifted version 
Q&—1 = 3484-2 — An-3 +2 to get an = 35a4 1 — 35 an—2 + an-3. For ag = 1, a1 = 36 we computed the 
sequence with the original relation: 

? n=7; 


? ts-vector(n); ts[1]=1; ts[2]=36; 
? for (k=3,n,ts[k]=34*ts [k-1] -ts [k-2] *2) ; 


? ts 
[1, 36, 1225, 41616, 1413721, 48024900, 1631432881] 

The same sequence can be computed with the relation without constant: 
? ts=vector(n); ts[iJ=1; ts[2]=36; ts[3]=34*ts[2]-ts[1]+2; 


? for (k=4,n,ts[k]=35*ts [k-1]-35*ts [k-2] *ts [k-3]) ; 
? ts 


[1, 36, 1225, 41616, 1413721, 48024900, 1631432881] 


35.1.3.2 The general case 


If the recurrence is of the form 
An = Mian-1 maàs-3-... - Mkan-k + P(n) (35.1-14) 
where P(n) is a polynomial of degree d in n, then a homogeneous recurrence of order k + d - 1 
an = Mi4n-1+ Modano +... + Merar1 0n—k-d-1 (35.1-15) 


can be found by repeatedly subtracting a shifted relation. 


The following GP routine takes as input a vector of the multipliers m; (i = 1, ..., k) and a polynomial 
of degree d in n. It returns a homogeneous recurrence relation as a vector [Mi, ..., Mx+a+1): 
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ihom2hom(m, p)= 
{ 
local(d, M, k); 
if ( p==0, return(m) ); 
d = poldegree(p, ’n); 
k = length(m) ; 
M = vector (k+d+1) ; 
for (j=1, k, M[jl=m[j]); 
for (s=1, d+1, 
M[1] += 1; MAN left side 
for (j=2, kts, M[j] -= m[j-11; ); 
m = M; 
); 
return (M); 
} 


To verify the output, we use a (slow) routine that directly computes the values of an inhomogeneous 
recurrence: 


ihom(v, m, k, p)= 
{ 
local(n, r); 
if ( k<=0, return(v[1]) ); 
n = length(m); 
r = vector(n); 
for (i=1, k, 
for (j=1, n-1, r[j]=v[j+1] ); \\ shift left 
r[n] = sum(j=1,n, m[n+1-j]*v[j]); AN new element (convolution) 
r[n] += subst(p, ’n, i+n-1); \\ add inhomogeneous term 
v=; 


return( r[1] ); 


} 


We use the recurrence relation an = 34n-1 + 24n-2 + (n3 =n- 7). We compute the homogeneous 
equivalent (intermediate values of M added): 


? m=[3,+2];p=n"3-n"2-7; 
? M=ihom2hom(m,p) 
[3, 2, 0, 0, 0, 0] 
[4, -1, -2, 0, 0, 0] 
[5, -5, -1, 2, 0, 0] 
[6, -10, 4, 3, -2, 0] 
[7, -16, 14, -1, -5, 2] 
[7, -16, 14, -1, -5, 2 \\ a_n = 7*a_{n-1} - 16*a_{n-2} + 14*a_{n-3} +- ... 


We can compute the first few values for the sequence starting with ag = 2, a1 = 5 by the direct method: 


? v-[2,5]; 
? for(k=0,9,print(k,": ",ihom(v,m,k,p))); 


RO 


(O00 O OTHSCIN) 
ROUTRURNO)R IN 


A vector of start values and the homogeneous equivalent allow the fast computation using the powering 
algorithms: 


? V=vector(length(M) ,j,ihom(v,m,j-1,p)) 
[2, 5, 16, 69, 280, 1071] 

? for(k-0,9,print(k,": ",frec(V,M,k)[1])); 
[- same output as with direct computation -] 


The computation of a10,000 now takes less than a second: 


? z-frec(V,M,10^5)[1]; AN result computed in 156 ms. 


? 1.0*z 
1.72279531330182 E55164 
? z-ihom(v,m,10^5,p); \\ result computed in 6,768 ms. 


ONOJN e 
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35.1.4 Recurrence relations for subsequences 
35.1.4.1 Two-term recurrences 


The recurrence for the subsequence of every k-th element of a two-term recurrence an = Q an—1 + B an-2 
can be found as follows. Write 


an+0 = AÁoan + Boan-o = 2an — lan-o (35.1-16a) 
an1 = Aja, +Bian-ı1 = Qan + banı (35.1-16b) 
an42 = Aodn+ B20n-2 = (o? +28)an — B’ ana (35.1-16c) 
an3 = Azan+B30n-3 = (a°+308)an + B%an-3 (35.1-16d) 
Anja = Agdn+ B404 4 = (af +4076 +26") an — B*an—a (35.1-16e) 
anik = Akan + Bx An-—k (35.1-16f) 


We have an = Ax às + By às 2x where Ao = 2, Ay = a and Ag41 = 0 Ar +8 Ag_i (and Bj = —(—8)*). 
That is, the first coefficient A; of the recursion relations for the subsequences can be computed by the 
original recurrence relation. For efficient computation use 


o g|" 
|Ak, Aka] = [2, a] l | (35.1-17) 
a 
A closed form for A; in terms of Chebyshev polynomials is given in [39] item 14]: 
Ar = 2(-8) Ty (a//=48) (35.1-18) 


A simple example, let F, and L, denote the n-th Fibonacci and Lucas number, respectively. Then 
o — B —1and 


k 
0 1 
lAr, Arl] 2 E l = (jo Dryl (35.1-19) 
That is 
Fk n+e = Ly Fr (n=1)+e — (—1)* Fr (n—2)+e (35.1-20) 


where k € Z and e € Z. The variable e expresses the shift invariance of the relation. 


35.1.4.2 Recurrences of order n 


For the stride-s recurrence relations of order n the following may be the most straightforward algorithm. 
Let p(x) be the characteristic polynomial of the recurrence and M its companion matrix. Then the 
characteristic polynomial of M? corresponds to the recurrence relation of the stride-s subsequence. 


recsubseq(n, s, m-0)- 
{ /* Return vector coefficients of the stride-s subsequence 
I the n-th order linear recurrence. 
* 
local(p, M, z, r); 
if ( O==m, 
m = vector(n,j,eval(Str("m" j))); AN use symbols m. j 
, /* else */ 
n = length(m); \\ m given 
); 
p = vec2charpol(m); 
M = matcompanion(p); 
z = x^n-charpoly(M^s); 
r = vector(n,j,polcoeff(z,n-j,x)); 


return( r ); 
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For the second order recurrence we find what we have already seen for s = 0,... 


? 
? 


7 
7 


? m-[a,b]; 
? for(s=-2,5,print(s,": ",recsubseq(0,s,m));); 
-2 [1/b^2*a^2 + 2/b, -1/b^2] 
i [-1/b*a, 1/b] 
0 [2, -1] 
1: [a, b] 
2: [a^2 + 2*b, -b^2] 
3: [a^3 + 3*b*a, b^3] 
4 [a^4 + 4*b*a^2 + 2*b^2, -b^4] 
5 [a^5 + 5*b*a73 + 5*b^2*a, b^5] 
For the third order recurrence we find: 
m- [a,b,cl; 
for(s=-2,5,print(s,": ",recsubseq(0,s,m));); 
-2: | [2/-c*a - 1/-c^2*b^2, 1/-c^2*a^2 + 2/-c^2*b, 1/c^2] 
-1: [-1/c*b, 1/-c*a, 1/c] 
0: [3, -3, 1] 
1: [a, b, c] 
2: [a^2 + 2*b, 2*c*a - b^2, c^2] 
3: [a^3 + 3*b¥*a + 3xc, -3*c*b*a + (b^3 - 3*c^2), c^3] 
4: [a°4 + 4*b*a^2 + 4*c*a + 2*b^2, \ 
-2*c^2*a^2 + 4*c*b^2x*a + (-b^4 + 4*c^2*b), \ 
c^4] 
5: [a^b5b + 5*b*a73 + 5b*c*a^2 + b*b^2xa + Bxcxb, \ 


5b*c^2*b*a^2 + (-5xcxb"3 + 5*c^3)*a + (b^5 - 5*c^2*b^2), \ 
c^5] 


35.1.5 Generating functions for recurrences 
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A generating function for a recurrence has a power series where the k-th coefficient equals the k-th term 
of the recurrence. For example, for the Fibonacci and Lucas numbers: 


ZE = 04 ata? 42094 33-4 535 4838 E1327 +... = NT ak (35.1-21a) 
1— z- zr? ui 
2 = oo 
y <= 2404 80? 4x + Tot + 112? + 182° 42907 +... = Y Lya* — (35.1-21b) 
1— z- zr? m 
In general, for a recurrence an = bea my An —k With given ag, 41, ..., ay we have 
K-1 
Lg br 2 
EE : = J aya (35.1-22a) 
1 mE Mj v4 j=0 
where the denominator is the reciprocal polynomial of the characteristic polynomial and 
bo = ao (35.1-23a) 
bi = 034 — (ao mı) (35.1-23b) 
ba = ag- (ao ma + Q1 mı) (35.1-23c) 
b3 = 03 — (ao Mg + 41 Ma + a2 mı) (35.1-23d) 
k-1 
be = ak- ama; (35.1-23e) 
j=0 
As an example we choose the sequence 
[0, 0, 1, 1, 2, 4, 7, 13, 24, 44, 81, 149, 274, ...] (35.1-24) 


with the recurrence relation an = an—1 + Gn—2 + An-2: 
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? a=[0,0,11; m-[1,1,1]; K=length(m); 

? b-vector(K, k, al[k]-sum(j=0,k-2, a[j*il*m[k-j-11)) 
[0, 0, 1] 

? pb=sum(j=0,K-1,b[j+1]*x"3) 


x 
? pr=1-sum(k=1,K,m[k]*x"k) \\ reciprocal of charpoly 
-x°3 - x°2 -x+1 . . 
? gen-pb/pr \\ the generating function 
x°2/(-x73 - x2-x + 1) 
? t=taylor(gen, x) 
X^2 + x73 + 2*x^4 + 4*x^b + T*x^6 + 13*x^7 + 24*x^8 + 44*x"9 + 81*x^10 
+ 149*x^11 + 274*x^12 + 504*x"13 + 927*x^14 + 1705*x^15 + 3136*x"16 + O(x717) 
? t=truncate(t); for(j=0,poldegree(t),printi(" ",polcoeff(t,j))) 
O O 1 1 2 4 7 13 24 44 81 149 274 504 927 1705 3136 


Note that the denominator is the reciprocal of the characteristic polynomial. The general form of the 
expressions for a two-term linear recurrence can be computed using symbols: 


? a-[a0,a1]; m=[mi,m2]; K=length(m); 
? b=vector(K,k,a[k]-sum(j=0,k-2,a[j+1] *m[k-j-1])) 
[a0, —mixa0 + al] 
? pb=sum(j=0,K-1,b[j+1]*x"3) 
(-mi*a0 + ai)*x + a0 
? pr-i-sum(k-1,K,m[k]*x^k) 
-m2*x^2 - mixx + 1 
? gen=pb/pr \\ the generating function 
((-m1*a0 + ai)*x + a0)/(-m2*x^2 - mixx + 1) 
? t-taylor(gen,x); 


? t=truncate(t); for(j=0,poldegree(t),print(j,":  ",polcoeff(t,j))) 
2 al 
za 
2: m2*a0 + mixal 
3: m2x*mix*a0 + (m172 + m2)x*al 
4:  (m2xm1^2 + m272)xa0 + (m173 + 2x*m2*m1)x*al 
5:  (m2xm1^3 + 2x*m2"2*m1)x*a0 + (m174 + 3xm2*m1^2 + m2^2)*a1 
6: (m2*m1i74 + 3x*m2"2*m172 + m2^3)*a0 + (m175 + 4xm2x*m1^3 + 3x*m2"2*m1)x*al 
7: (m2*m1i75 + 4*m2^2xm1^3 + 3xm2^3*mi)*aO0 + (m176 + 5xm2*m174 + 6xm2^2*m1^2 + m2^3)*a1 


35.1.6 Binet forms for recurrences 


A closed form expression for the Fibonacci numbers is 


fa A-5 
F, = 12 
7 > > (35.1-25) 


A closed form solution for the two-term recurrence an = m4 4, -1 + Ma A, -2 is given by 


n 


[((a4 — ao r1) ry — (a1 — ao ro) r1] (35.1-26a) 


where w = ym? +4ma, ro = (m, + w)/2, and rı = (m, — w)/2 (the relation is valid only if rọ Æ r1). 
For ay = 0 we have 


ay 
n = — [rori .1-2 
g "T [ro — r1] (35.1-27a) 
If ay = 0, a, = 1, ged (m1, m2) = 1, and a, > 0 for n > 0, then (compare with |39.11-3a| on page|797) 
gcd (a;,a;) =  Agca(ij) (35.1-28) 
With integer mi = k > 0 the a, are positive for all n > 0 if ma > — ey where [ei, es, ..., es) 


= [0,1,2,3,6,7,12,15] and ej = +2€x-1 — €k-2 + €k-4 — 2e&—5 + €x-6 for k > 9. The sequence of 
values e; starts as 


» 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
> 6, 7, 12, 15, 20, 23, 30, 35, 42, 47, 56, 63, 72, 79, 90, 99, 110, 


With ay = 0, a1 = 1, gcd(m4, ma) = 1, but mı and ma are otherwise arbitrary, we have 


ged (|as|,|a;|) =  [agca(i,;| (35.1-29) 
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35.1.6.1 Binet form for n-term recurrences 


We use a three-term recurrence to exemplify how expressions like |35.1-26a| can be found in general: Let 
An = M1 Gn_1 + M2 An_2 + M3 An 3, its characteristic polynomial is p(x) = z? — (m4 x? + ma £ + ma). 
Let ro, r1, r2 be the roots of p(x). The Binet form of the recurrence is 


An = coro + CT] + Cary (35.1-30) 


The coefficients co, c1, and cz have to satisfy 


ag = Cor «ct € (35.1-31a) 
a = ToCo-FTiCi +12 C2 (35.1-31b) 
a2 = ró co +r? c +73 C2 (35.1-31c) 


We have to solve the equation Z -c = a for the vector c where a is the vector of starting values and 


1 1 1 
Z = To Tj ra (35.1-32) 
2 2 2 


Verification with the recurrence an = an—1 + Gn—2 + Gn—2 starting with ay = a, = 0 and a2 = 1: 


? a=(0,0,1]~; m-[1,1,1]^; K=length(m); 
? p=x"K-sum(k=1,K,m[k]x*x" (K-k)) \\ characteristic polynomial 


X 3 =x m X Es 
? r-(polroots(p)) 


[1.8392867, -0.419643 - 0.606290*I, -0.419643 + 0.6062907*1]^ 
? Z-matrix(K,K,ri,ci,r[ci]^(ri-1)) 
[1 1 1] 


[1.839286 -0.4196433 - 0.6062907*I -0.4196433 + 0.6062907*I] 
[3.382975 -0.1914878 + 0.5088517*I -0.1914878 - 0.5088517*I] 


? c=matsolve(Z,a) 
[0.1828035 + 1.8947 E-20*I, -0.09140176 - 0.3405465*I, -0.0914017 + 0.3405465*I]^ 
? norm(Z*c-a) NX check solution 
[1.147 E-39, 6.795 E-39, 3.673 E-39]^ 
? seq(n)=sum(k=0,K-1,c[k+1] *r[k+1] ^n) 
? for(n=0,20,printi(" ",round(seq(n)))) 
0011247 13 24 44 81 149 274 504 927 1705 3136 5768 10609 19513 35890 


The method fails if the characteristic polynomial has multiple roots because then the matrix Z is singular. 
35.1.6.2 Binet form with multiple roots of the characteristic polynomial 


If the characteristic polynomial has multiple roots, the Binet form has coefficients that are polynomials 
in n. For example, for the characteristic polynomials p(x) = (x — ro)? (x — rı) the Binet form would be 
an = (co +n do + n? eg) rd + ei rf. With n = 0, 1, and 2 we obtain the system of equations 


a =  (co+0d+0e)+ c (35.1-33a) 
ay = TQ (co +1 do + 1? en) TT (35.1-33b) 
az = rileo+2d.+2 ep) o rei (35.1-33c) 


In general, the coefficient of the power of the k-th root rz in the Binet form must be a polynomial of 
degree m; — 1 where mj is the multiplicity of rz. 


35.1.6.3 The special case c = 1 for all k 


Let p(x) be the characteristic polynomial of a recurrence, with roots r;: p(x) = [], (x — rx). We want to 
determine the generating function for the recurrence such that a; = 57, rj (that is, all constants cj are 
one). For the reciprocal polynomial h of p we have h(x) = |, (1 — rx zx) and (using the product rule for 
differentiation) 


(35.1-34) 
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With r/(1— rg) = 55559 r)*! ad we find that 


_ a a = (= 4) E (35.1-35) 


j20 


That is, a; = a pitt and cx = 1 for all k. The relation is the key to the fast computation of the trace 


vector in finite fields, see relation |42.3-6 on page 896 
35.1.7 Logarithms of generating functions 1 


A seemingly mysterious relation for the generating function of the Fibonacci numbers 


1 o , 
fa) = ——,, = 1424243045048 +... = Y Feyi” (35.1-36a) 
1-2-2? m 
is 
lg(f(r) = a+ gui de mela A is = e (35.1-36b) 
2 E 4 5 CC et RE 
where L;, are the Lucas numbers. Similarly, 
1 
g(x) := roa a= eet 5a? + 122? + 292* + 70x? + 169z +... (35.1-37a) 
4 — T 
Lx A AA eee lh oe 
log(g(z)) = 2 |a4 337 +37 +7 iva +52 +¿9%z +... (35.1-37b) 
Now set f(x) =: TOL then 
d d 1 h'(x) 
al = l = 35.1-38 
plete) = Eef] = ce (35.1-38) 


The expression Lu is again the generating function of a recurrence and formal integration of the power 
1 


series gives the factors ;. This is a special case of the algorithm for the computation of the logarithm 


for powers series given in section [32.3 on page 630 
35.2 Chebyshev polynomials 


The Chebyshev polynomials of the first (T) and second (U) kind can be defined by the functions 


Tn(x) =  cos[n arccos(z)] (35.2-1a) 
DU = oe (35.2-1b) 


For integral n both of them are polynomials. The first few polynomials are given in figure |35.2-A| (first 
kind) and figure |35.2-B| (second kind). 


Expressions as hypergeometric series are given as relations |36.3-7b| and |36.3-7c on page 695| Explicit 
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T_p,(2) = T,,(z) 
T_1(x) = x 
To (x) = 1 
Ti (x) = T 
Ta(x) = 22? —1 
T3(x) = Az? —32 
Tala) = | 8z*—8z2741 
Ts(x) = 162? — 202? +52 
To(x) = 3229 — 4824+ 187? — 1 
T7(x) = 6427 — 11225 +5623 — 7x 
Tela) = 12825 — 256x + 16024 — 322? 1 
Tola) = 25629 — 57627 + 4322? — 1202? +9x 
Tio(x) = 512219 — 128025 + 1120.26 — 400 z4 + 502? — 1 
Ti (a) = 1024 x1! — 2816 z? + 2816 27 — 1232 2° + 22027 — 11x 
Figure 35.2-A: The first few Chebyshev polynomials of the first kind. 
U_n(2) T —Un-2(£) 
U_2(x) = —1 
U.4 (x) = 0 
Uo (x) = 1 
Ui (x) = 2x 
U(x) = Az? —1 
Ua (x) = 822-42 
U4(x) = 1624 —1227+1 
Us (ac) = 322° — 322? +62 
Us(z) = 642°— 8024+ 2427-1 
U7(x) = 128 £7 — 19225 + 802° — 82 
Us (x) = 256 x8 — 448 xé + 240 zt — 40 z? + 1 
Ug (x) = 5122? — 10242" + 672 x5 — 160 x? + 10x 
Uro(x) = 1024 z1? — 2304 x8 + 1792 38 — 560 x^ + 6027 — 1 
Uii (a) = 2048 x! — 5120 z? + 4608 z7 — 1792 x? + 280 z? — 12 x 
Figure 35.2-B: The first few Chebyshev polynomials of the second kind. 
expressions are 
[n/2] 
n (n — k— 1) - 
Tn = 5 os 2-2 
© = 3 D OD io (35.2-2a) 
[7/2] 
1 n=k 
k n—2k 
= 2 .2-2b 
D (e (35.2-2b) 
[7/2] se 
Rs n—2k 2 k E 
2 E a x (z^ — 1) (35.2-2c) 
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and 
dia (n — k)! 
U,(x) = 2 ENE En — Db (2255-95 (35.2-3a) 

[n/2] 

E EE: n—k pyn-2k : 

= 2, C) ( " ) ex) (35.2-3b) 
[n/2+1] 

n+1 n—2 2 
= 2, ae k (q? — 1)" (35.2-3c) 


The indexing of U seems to be slightly unfortunate, having Uy = 0 would render many of the relations 
for the Chebyshev polynomials more symmetric. 


The n + 1 extrema of T,,(x) are located at the points x, = cos Ex where k = 0,1,2,...,n and -1 € 


£k < +1, which can be seen from the definition. The values at those points are +1. The n zeros lie at 
E (k-1/2) 7 = 
Lp = cos —,—— where k = 1,2,3,...,n. 


The expansion of x” in terms of Chebyshev polynomials of the first kind is, for n even, 


n/2—1 
si 1 n 1 n 
== On ae is 9n-1 » (i) Ta 2k (2) (35.2-4a) 
and, for odd n, 
1 (n—1)/2 - 
y" ne 25 @ Tn 21 (2) (35.2-4b) 


k=0 


For the Chebyshev polynomials of the first kind we have 


1 la 
T w+ l/s = a” + 1/2” (35.2-5) 
2 2 
This relation can be used to find a solution of T,, (1) = z directly. Indeed 
Ra +1/Rn 1/n 

= En Uf where Rẹ := (s Lg 1) (35.2-6) 
is a solution which can be chosen to be real if z € R and z » 1. Thus we have the closed form expression 
T = _— where r :— (z + yz2?- 1) (35.2-7) 


35.2.1 Recurrence relation, generating functions, and the composition law 
Both types of Chebyshev polynomials obey the same recurrence (omitting the argument x) 
Nn = 22 Nn—1 I Nn—2 (35.2-8) 


where N can be either symbol, T or U. Recurrence relations for subsequences are: 


Noa = [a]: Nn- Nn- (35.2-9a) 
Nn+2 = [2(22?-— 1)] - Na — Na-2 (35.2-9b) 
Nau = [2(40° — 32)] - Na — Na-3 (35.2-9c) 
Nau = [2 (8a*- 82? +1] - Na — Nya (35.2-9d) 
Na+s =  [2(162? — 202? + 52)] - Na — Nas (35.2-9e) 
Nau. = PT (0)]-Na— Nue (35.2-9f) 
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The identities are equivalent to 


cos(p +7) = 2 cos(y) cos() — cos(p — y) (35.2-10a) 
sin(y +7) = 2 cos(*) sin(y) — sin(y — y) (35.2-10b) 
The generating functions are 
l-zt = 
— = UI .2-11 
1-2zt4 8 = (2) pa 
: = VY PU.) (35.2-11b) 
1-2st48 ^ £2 "* = 


Quick check of relation |35.2-11a| using GP: 


? gen-truncate(taylor((1-t*x)/(1-2*x*t*t^2) ,t)); 
? for(k-0,5,print(k,": ",polcoeff(gen,k,t))); 
0: 1 
i: x 
2: 2*x72 - 1 
3: 4kx73 - 3*x 
4: *x^4 - 8xx"2_+ 1 
5: 6*x^5b - 20*x^3 + 5*x 


Binet forms for T (compare with relation|35.2-7) and U are 


1 n n 
Tz) = 5 IC +/x?— 1) + (« - yz? 1) | (35.2-12a) 
1 n+1 n+1 
= | 2 2 S 
U, (x) Wa 6 vz 1) (z Vz 1) | (35.2-12b) 
We have (compare to relation |35.1-28| on page |674) 
ged (Un-1, Uni) = Uscd(n—1,m—1) (35.2-13) 


Composition is multiplication of indices as can be seen by the definition (relation |35.2-1a]: 


T.(Tm(z) = Tamlz) (35.2-14) 

For example, 
Ton(2) = T(T,(x)) = 2T2(z)-1 (35.2-15a) 
= 1,(To(x)) = T,(227 — 1) (35.2-15b) 


35.2.2 Index-doubling and relations between T and U 


Index-doubling relations for the polynomials of the first kind are 


Ton = 292 t (35.2-16a) 
Tan+1 = Y Th+1 Th —c (35.2-16b) 
Tui AST -T (35.2-16c) 


Similar relations for the polynomials of the second kind are 


Us, = U2—U2, = (Un + Un-1) (Un — Us 1) (35.2-17a) 
= Un (yo l= Un-1 ias — Un-1) +1 (35.2-17b) 

Uont41 = Un (Un+1— Un-1) (35.2-17c) 
= 2U, (Unsi -2U,) = 20, (xU, — U, i) (35.2-17d) 

Uen-1 = Un-1 (Un — Un-2) (35.2-17e) 
eS. a (Un —£Un-1) = 20,4 (£ Un—1 — Un—2) (35.2-17£) 
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Some relations between T and U are 


1 

Tn = U,-2Uy_-1 = 20,-1 =Un-2 = 5 (Un — Un_2) (35.2-18a) 
Tai = 2Th-0-2%) Uni (35.2-18b) 
Us, = 2T4Ua,—1 (35.2-18c) 
Uan -— = 2 Tn Un-1 = 2 (Ti Un + 2) (35.2-18d) 
U2n+ = 2Tn+1 Un = 2 (Tr+2 Un-—1 + t) (35.2-18e) 

n-1 
Ux- = 2 |] Tx (35.2-18f) 

k=0 

Relation |35.2-18b| written as 
x Th+1 = Th+2 Ta V XT. 

= = .2-19 
Un 1-2? 1 — zr? ee ) 


Triem insite: = 2 La Em (35.2-20a) 
T imn Toe A (35.2-20b) 
Onim=i FUnomoas. = 28154 Em (35.2-20c) 
Un+m=1=Un=m-1 = 2TnUm-1 (35.2-20d) 
Expressions for certain sums: 
2 1 
» = 5 (1+ Van) (35.2-21a) 
k=0 
n—1 1 
5 Tək+1 = 3 Usi (35.2-21b) 
k=0 
Z 1— Ton 
YN Ua = 5% (35.2-21c) 
= 2(1— z?) 
n—1 
X — Ton41 
= => .2-21 
5 U2k+1 2(1— a2) (35.2-21d) 
k=0 
From the relation 0,cos(n arccos(x)) = n sin(n arccos(z))/V 1 — a? we obtain 
O,T,(x) = nU, i(x) (35.2-22) 


35.2.8 Fast computation of the Chebyshev polynomials 


We give algorithms that improve on both the matrix power and the polynomial-based algorithms. 
35.2.3.1 Chebyshev polynomials of the first kind 


For even index use relation |35.2-16a] (Tən = 2 T2 — 1). For odd index we use relations |35.2-16c| and 


We compute the pair [T;, 1, Tn] recursively via 


Ta-1 In] = [2 Ty-1 Tq —%, 2T? — 1] where g=n/2, ifn even (35.2-23a) 
[Ti Tn] = [2724-1 27,11;—-2z| where q=(n+1)/2, ifnodd (35.2-23b) 
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No multiplication with x occurs, therefore the computation is efficient also for floating-point arguments. 
With integer x the cost of the computation of T,(r) is O(M(n)) where M(n) is the cost of a multi- 
plication of numbers with the precision of the result. If x is a floating-point number, then the cost is 
O (log5(n) M(n)) where M (n) is the cost of a multiplication with the precision used. 


The code for the pair computations is 


1 fvT(n, x)= 
2 (/* return [ T(n-1,x), T(n,x) ] */ 
3 local(nr, t, t1, t2); 
4 if ( n<=1, 
5 if ( 1==n, return( [1, x] ) >; 
6 if ( O==n, return( [x, 1] ) >; 
7 if ( -1==n, return( [2*x*2-1, x] ) ); 
8 return( 0 ); \\ disallow negative index < -1 
9 ); 
10 
11 nr = (n+1) >> 1; NN if ( "n even", nr = n/2 , nr = (n+1)/2; ); 
12 vr = fvT(nr, x); \\ recursion 
13 ti = vr[1]; t2 = vr[2]; 
14 if ( !bitand(n,1), NN n is even 
T t = [2*ti*t2-x, 2*t2^2-1]; 
17 t = [2*t1^2-1, 2*t1*t2-x]; 
18 H 
19 return( t ); 
20 
'The function called by the user is 


fT(n, x)= 
1 
local(q, t, v, T); 
n = abs(n); 
if ( n<=1, 
if (m>=0, return(if(0==n,1,x))); 
return( fT(-n, x) ); 


1 
2 
3 
4 
5 
6 
7 
8 
9 t=0; q=0m; 

10 while ( O--bitand(q, 1), q>>=1; tt=1; ); 
11 \\ here: n--2q*2^t 

12 T = fvT(q, x)[21; 

13 while ( t, T=2*T*T-1; t-=1; ); 

14 return( T ); 

15 


} 
We check the speedup by comparing with the matrix-power computation that gives identical results. We 
compute T4.545,967(2), a number with more than 2,600,000 decimal digits: 


vT(n,x)= return( ([1, x]*[0,-1; 1,2*x]^n) ); 
x-2; AN want integer calculations 
n=4545967; 

vT(n,x); \\ computed in 9,800 ms. 
fvT(n,x); \\ computed in 2,241 ms. 


C++ implementations for the computation of T,(2) and T,(x) modulo m are given in [FXT: 


mod/chebyshevl.cc.. Methods similar to the one shown here for the computation of Fibonacci and 


Lucas numbers are given in [158] sect.16.7.4-16.7.5, p.106-107]. 


35.2.3.2 Chebyshev polynomials of the second kind 


We can use the fast algorithm for the polynomials of the first kind and relation |35.2-19 on the preceding 
(Un = (Tn — £ T5,41)/(1 — z?)), this involves a division: 


1 fU(n, x)= 

2 t 

3 local(v); 

4 if ( 1==x, return(nt1) ); \\ avoid division by zero 

5 if ( -1--x, return ( if ( bitand(n,1), -(n*1), (ti) ) ) ); MAN avoid division by zero 
6 v = fvT(m+1, x); 

7 return( (v[1]-x*v[2]1)/(1-x^2) ); 

8 P 
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We give an additional algorithm that involves three multiplications for each reduction of the index n. 
One multiplication is by the variable z. We compute the pair [U,-1, Un] recursively via 


M, := (Uy+U 9-1) (Ug — Ui) (35.2-24a) 


[Un-1, Un] = [2Ug-1 (U¿—vU¿-1), My] where 


[Uni Un] = [M,, 2U, (x U, — Us=1)) where q= (n — 1)/2, 


The code for the pair computations is 


1 fvU(n, x)= 

2 (/* return [ U(n-1,x), U(n,x) ] */ 

3 local(nr, ui, u0, ue, t, u); 

4 if ( n<=1, 

5 if ( 1==n, return( [1, (2*x)] ) ); 

6 if ( O==n, return( [0, 1] ) ); 

7 if ( -1==n, return( [-1, 0] ) ); 

8 if ( -2==n, return( [-(2*x), -1] ) ); 
9 return( 0 ); \\ disallow negative index < -2 
10 23 

11 

12 nr = n >> 1; \\ if ( "n even", nr = n/2 , nr = (n-1)/2; ); 
13 vr = fvU(mr, x); NN recursion 

14 ul = vr[1]; u0 = vr[2]; 

15 ue = (u0+u1) * (u0-u1); 

16 if ( !bitand(n,1), AN n is even 

17 t = ul*(10-x*u1); t+=t; 

18 u = [t, uel; 

19 , 

20 t = u0*(x*u0-u1); t+=t; 

21 u = [ue, t]; 

22 25 

23 return( u ); 

24 $ 


1 


The function called by the user is 


fU(n, x)= return( fvU(n,x)[2] ); 


The comparison with the matrix-power computation shows almost the same speedup as for the polyno- 


mials of the first kind: 
vU(n,x)= return( [0, 1]*[0,-1; 1,2xx]^n ); 


x-2; 


\\ want integer calculations 


n=4545967; 
vU(n,x); \\ computed in 9,783 ms. 
fvU(n,x); \\ computed in 2,704 ms. 


C++ implementations for the computation of U,(2) and U,(x) modulo m are given in [FXT: 


mod /chebyshev2.cc.. 


35.2.3.3 Symbolic computation 


For symbolic computations the explicit power series as in 


preferred. The following routine computes 7;, as a polynomial in z: 


1 chebyTsym(n, x)= 

2 f{ 

3 local(b, s); 

4 if ( n<O, n=-n); \\ symmetry 

5 if ( n==1, return( x ) ); \\ avoid division by zero 
6 b = 2^(n-1); 

7 if ( O==n%2, if ( O--n/4, s=+1, s--1 ), s=0 D); 
8 forstep (k=n, 1, -2, 

9 s += b*x^(k); 

10 b *= -(k*(k-1))/((n+k-2)*(n-k+2)); 

11 š 

12 return( s ); 

13 $ 


1 


To compute U,,, use 


chebyUsym(n, x)= 


35.2-2a| or |35.2-3a| on page should be 


Re RRR Re 
OMA WN E C 00-1099 01 Oo b2 
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} 


local(b, s); 

if ( n<=0, 
if ( n>=-2, return( n*1 ) ); 
return( -chebyUsym( -n-2, x ) ); 


, 


+= 1; 
2^(n-1) / n; 


) 
n 
b 
ST , 

forstep (k=n, 1, -2, 

S += (k)*b*x^(k-1); 

b *= -(k*(k-1))/CGO*k-2) * (n-k*2)) ; 


; 
return( s ); 


\\ use symmetry 


35.2.4 Relations to approximations of the square root 1 


35.2.4.1 Padé approximants for yx? +1 
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k Rx(2) Fa (a) 

1 2/1 (x) / (1) 

2: 7/4 (2x? — 1) / (2x) 

3: 26/15 (4x3 — 3x) / (4x? — 1) 

4 97/56 (8x4 — 82? +1) / (8x? — 4x) 

5: 362/209 (162? — 2023 + 5x) / (16z* — 122? + 1) 
00: V3 Va? —1 


Figure 35.2-C: The first few values of R,(2) and Ry (x). 


k REO) Rt (x) 

o 1/1 (2) /(1) 

2 3/2 (2x? + 1) / (2x) 

3: 7/5 (4x3 + 3x) / (4x? +1) 

4 17/12 (8x* + 8a? + 1) / (8x? + 4x) 

5 41/29 (162° + 202? + 5x) / (16x* + 122? + 1) 
00: /2 g2+1 


Figure 35.2-D: The first few values of R} (1) and Rj (x). 


We start with the relation (from the definitions |35.2-1a| and |35.2-1b on page 676|and sin? + cos? = 1) 


Ta (DU. 4 


Now rewrite the equation as 


vVz2—1 = 


If we define Rn = T4,/U,, 1, then 


= 1 


T= Í 


U2 


n—1 


(35.2-25) 


(35.2-26) 


(35.2-27) 
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A composition law holds for R: 

Ras(r) = Rm (Rn(2)) (35.2-28) 
The first few values of R;,(2) and Ri (x) are shown in figure 
If we define T} (x) := T(iz)/i" and U} (x) := U (i x)/i”, then 


Ti?—-(z?-1)Uj? = (-1)" (35.2-29) 
Defining Rz := T+/U;*_, we have 
Rento) = Ri (Rio) (35.2-30) 
and 
Tt? 1 pr 
= n "^ n = + 
Va?41 = "uec 9 wis Rt (ax) (35.2-31) 


The first few values of R{(1) and R$ (x) are shown in figure [35.2-D| Relations |35.2-29| and |35.2-25| can 
be used to power solutions of Pell’s Diophantine equation, see relation |39.13-13a| on page 

35.2.4.2 Two products for the square root 

For those fond of products: for d > 0, d Z1 


r1 1 d+1 
vd = |[(1+—] where q= cea dk41—292—1 (35.2-32) 
me Qk d— 1 


(convergence is quadratic) and 


r3 2 
và = [I (: F =) ee pS A OO a 
IL Um = 


(convergence is cubic). These are given in and also in [142], more expressions can be found in [133]. 
The paper gives hi4 = 2% ME, (h?) — 3. For q; in relation |35.2-32| we have 


qk = T»(qo) (35.2-34) 
1 d—1)" 2(1= q" 
= S i -= ( ) where N = 2* (35.2-35) 
dk Dino (24) d (1+ va)? + (1 — Vd)2N 
We have 
= T(1/c) whe ae egi (35.2-36) 
de = Aale) where e—4 y € .2- 
and 
1—c 1—c Uar_1(1/c) 
MITT = 35.2-37 
l-c c — Ty(l/o) ( ) 
which can be expressed in d = E as 
14d 
2d Usa (+ 
Vd zm : where d»1 (35.2-38) 
1—d T. e) 
25 lid 
We have U3x ,(z) = 2* D. To(x). Successively compute To: = 272, , — 1 and accumulate the 


product U5;.., =2U3:-1_1 Toi-1 until Vox _, and Tor are obtained. Alternatively use the relation Uy (1) = 
EH O: Tr+1(1) and use the recursion for the coefficients of T as shown in section |35.2.3.3| on page 


A systematic approach to find product expressions for roots is given in section on page [583] 


685 


Chapter 36 


Hypergeometric series 


We describe the hypergeometric functions which contain most of the ‘useful’ functions such as the log- 
arithm and the sine as special cases. The transformation formulas for hypergeometric series often give 
series transformations that are non-obvious. The computation of certain hypergeometric functions by 


AGM-type algorithms is described in section [31.4 on page 611 


36.1 Definition and basic operations 


a,b 
c 


2) is defined as 


JON 
c 


:= z (z+ 1) (z+ 2) ... (z+ k — 1) is the rising factorial power (2% := 1). Some sources use the 


The hypergeometric series F ( 


o ak bE ok 
k=0 © 


where z* 
Pochhammer symbol (x), which is the same thing: (1), = z^. We'll stick to the factorial notation. 


The variable z is called the argument, a,b and c are the parameters. Parameters in the upper and lower 
row are called upper and lower parameters, respectively. 


Note the k! = 1* in the denominator of relation |36.1-1 Keep the hidden lower parameter 1 in mind: 


pum "E bw". (36.1-2) 
os MES TE ad aa 


The previous expression is a sum of perfect squares if z is a square. 


When a hypergeometric series converges it corresponds to a hypergeometric function. We have 


b 1541 2b42 
2) = 14982 (145 = (+5 T2) (36.1-3) 


g 
[e 2 c+l 3 c+2 


Therefore hypergeometric functions with rational arguments can be computed with the binary splitting 
method described in section |34.1.2| on page [652] 


Hypergeometric series can have any number of parameters: 
99. de kk 
03:565 Oa ay ...ay, Z 
B ——— .1-4 
F( :) ae dE k! id 


by, ..., On 
These are sometimes called generalized hypergeometric functions. The number of upper and lower pa- 
rameters are often emphasized as subscripts left and right to the symbol F. For example, m+F for the 
series in the last relation. 
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The functions F E 2) (of type 1 F1) are sometimes written as M (a,b, z) or B(a;b; z). Kummer's function 


U(a, b, z) (or W(a; b; z)) is related to hypergeometric functions of type 2Fo: 
1 —b 
U(a,b,z) = 2 °F p 7 | — 1/2) (36.1-5) 


Note that series 2p are not convergent. Still, they can be used as asymptotic series for large values of z. 


The Whittaker functions are related to hypergeometric functions as follows: 


Lupe 
Maz) = A 2511/2 F ( 2 m "M :) (36.1-6a) 
1 
Wee) = e77? 2+2 y G +b-—a, 1+2b, :) (36.1-6b) 
L+b L— > 
e 2I ¿2 F E ME " = 1/2) (36.1-6c) 


Negative integer parameters in the upper row lead to polynomials: 


"(t = 1-9:+182?-102% (36.1-7) 


'The lower parameter must not be zero or a negative integer unless there is a negative upper parameter 
with smaller absolute value. 


Sometimes one finds the notational convention omitting an argument z — 1: 


Q1, ..., Qm Q1, +--+, Am 
F = F 
[onm ja 


In the following we never omit the argument. 


1) (36.1-8) 


In-depth treatments of hypergeometric functions are [17] and [356]. Many relations for hypergeometric 
functions are given in [Í] and [139] vol.1]. 


36.1.1 Derivative and differential equation 


i is 


The n-th derivative of a hypergeometric function f(z) = F da 2t 
"Ut us m 
:) _ a M aes D (36.1-9) 


d F a,b... 
dz” e, d, ... c" d" ... c+n,d+n,... 


The function f(z) = F (7 


z) is a solution of the differential equation 


@& f df 


z(1—z) da + [c — (1 +a + b) 2] dz abf = 0 (36.1-10) 
A general form of the differential equation satisfied by F Coe 2) is 
z(0+a)(0+b)(0+c)... f(z) = 0(9+u-1)(0+0-1)(0+w-1)... f(z) (36.1-11) 


where Y is the operator z i. The leftmost Y on the right side of the equation takes care of the hidden 


lower parameter 1: Y = (0 -- 1— 1). See for a beautiful derivation. Use relation|11.1-3a on page 278 


. ; e d 
to rewrite powers of Y as polynomials in 7. 


36.1: Definition and basic operations 


36.1.2 Evaluations for fixed argument 1 
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A closed form (in terms of the gamma function) evaluation at z = 1 can be given for 3/1: 


a, b I(c)I(c— a-— b) : 
F |: f Rele-a—b)>0 beN,b «c0 (36.112 
( e ) Fle- aTe) i e(c—a—b)>0 or be < ( ) 
If c— a — b « 0, then we have [356] ex.18, p.299] 
a, b P(c) P(a * b — c) -a-b 
lim F | 1-20) = 1 SM 
Es ( a 2) / ( Pun 59 Cone) 
and, for c — a — b = 0, 
: a, b T(a + b) 1 
lim F( ^ | log = 1 36.1-13b 
ea: ( c 2H os e) ( ) 
For z = —1 there is an evaluation due to Kummer: 
— 2 
a, b E _ T(1— a --5)T(1-4- a/2) (36.1-14a) 
l+a—b T(1+a)T(1+a/2 — b) 
T(1— a4 0) 
2 36.1-14b 
"TOF a/2)T (4 + a/2—b) ( ) 
Several evaluations at z = > are given in [1], we just give one: 
a, b 1 TL +Lla+ 5b) 
F ( | 5) = Viu nTa? (36.1-15) 
Leder pla r+ iE + 
For further information see chap.15], [349] entry “Hypergeometric Function"], and [273]. Various 


—an, bn+b3 


evaluations of F ( pda 


36.1.3 Extraction of even and odd part 
Let E[f(z)] = (f(z) + f(2))/2 (the even powers of the 


| 2) for integer a, b, c can be found in [138]. 


series of f(z)) and O[f(2)] = (f(z) — f(-2))/2 


(the odd powers). We express the even and odd parts of a hypergeometric series as follows: 


a, b a atl b bil 
E|r(* B cmi x e (36.1-16a) 
e 2:2 »2 
b b arl ar. Er b+2 
o[r(*. p] = il PE a e (36.1-16b) 
2° 2°92 


The lower parameters 1/2 and 3/2 are due to the hidden lower parameter 1. The general case for 


H(z) := E me :) (36.1-17a) 
1» +++) Un 
is 
à1 aj+l am Qm-l 
Bee NIU TT tat e (36.1-17b) 
22 Do Bae hora 
ay -am eH at ; amti, amt? " 
OG) = V EF. uu audi baad ee (36.1-17c) 
n 2 2 > 2 s > 
where X = 47=n=1. For example, 
1 ae ee P h 
z|r(3 3] = a ; =) e: F(, Z) = (36.1-18) 
12,2 2 d 
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36.1.4 Multisection by selecting terms with exponents s mod M 


Let H(z) be any power series. If we write His 11 (2) for the series obtained by selecting only the terms 
whose exponent of z is congruent to s modulo M (that is, exponents s, s+ M, s-- 2M, and so on), then 


M-1 
1 
Aig my(z) = M yg Ch z) where w := exp (27i/M) (36.1-19) 
k=0 
For a hypergeometric function H(z) replace every upper and lower parameter A by the M parameters 
(A+s)/M, (A s -1)M, (A+s+4+2)/M,...,(At+s+M-—1)/M, the argument z by X zM where 
X = (MMyn-7-1, and multiply by 


QUOD OS (36.1-20) 


For example, the following two forms give Ho 3] (2), the extraction of all terms of H(z) where the exponent 
of z is divisible by 3 (s — 0): 


2 a4 ail ai42 am mtl am+2 
HOMERUS)... y (. ICQ | xe) eet 
where w = exp(27i/3) and X = 27"7"-1. With H(z) = exp(z) = F( 2) we find 
Ap 3(z) = F (a 2 =) = > D = ; le eerte etn (36.1-22a) 
k=0 
The remaining two functions from the 3-section of exp(z) are 
38 © — 8k+1 1 : 
Ang (z) = zF (s 4 =) = M Geni c Di E le PRL e” d (36.1-22b) 
1 L ybk+2 1 " 
Hpa(z) = Zu 519 5) = » m 3k 3j ce [er + e”* + Ne" d (36.1-22c) 


where Q = w ^ = w°. Now write C, = Hi,3(z) for s € 40, 1,2}, then we have (omitting arguments 
here Q l 2 N ite C His.3] f 0,1,2}, th h itti 


Co Ci Cy : 
det |C2 Co Ci| = Có+CÍ+C3-3C9C1C2 = 1 = ee? e 7 (36.1-23) 
Cy C Co 


which is a three power series analogue of the relation cosh? — sinh? = 1, see [336]. 


36.2 Transformations of hypergeometric series 


As is obvious from the definition, parameters in the upper row can be swapped (capitalized symbols for 


readability): 
rá ee :) = B Ak :) (36.2-1) 
e 59 e, f, 9 


The same is true for the lower row. Usually one writes the parameters in ascending order. Identical 
elements in the lower and upper row can be canceled: 


a, b, C u a, b 
E Lc z) = JM f :) (36.2-2) 


These trivial transformations are true for any number of elements. The following transformations are 
only valid for the given structure, unless the list of parameters contain an ellipsis ‘...’. 
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36.2.1 Elementary and contiguous relations 


By definition we have 


i bo 1,6+1,...,1 
zZ = 1422 F Ui. MEL j 
Ci. e+1,...,2 


:) (36.2-3) 


Identities of the following type are called contiguous relations: 


b, ... 14:55 4% b+1,... 
e-r (° l :) a ar (°t p DEDI us D (36.2-4) 
T "P Cess 
a, b, ... a+1,b,... a, b, ... 
@-oF( SAT :) = oF ( "mo 2) -er ( d :) (36.2-5) 
These are given in [166], the following is taken from [356]. 
F(™?|2) - (^tt) Er) (36.2-6) 
c c C CFI 
More relations of this type are given in [1]. 
36.2.2 Pfaff’s reflection law and Euler’s identity 
Pfaff ’s reflection law can be given as either of: 
1 bj — —b 
NS NY ii iG | Sl me ge ^g (36.2-7a) 
(1— z)e ell—-z c 
b 1 —b| — 
F|” | z) = pp | i (36.2-7b) 
c (1— z)e c l-z 
1 c—a,b| —z 
= DEG NUES F i | .2- 
(1— z)? ( c 1- -) (usc) 
Applying the Pfaff reflection on both upper parameters gives Euler’s identity: 
b = = 
F (^ | :) = (1-zye9p (^ dd ^ :) (36.2-8) 
c c 
Now write Euler’s transform as 
a, b a+1/r,b+1/r I% 
' F ? = (1-2)!^ 36.2-9 
NUM 2)/ ( a+b+1/r 2) pe ( ) 


If both hypergeometric series terminate, then the expression on the left is a Padé approximant for the 
r-th root, see section [29.2.3.2| on page 


Euler’s transformation can be generalized for hypergeometric functions ,+1F,, see [240]. We give two 
transforms for hypergeometric functions 375 where one upper parameter exceeds a lower parameter by 
1, taken from [243] p.17]. The first is reminiscent of the Pfaff reflection, for f = e(b — a» — 1)/(e — as), 


41,09, € +1 u a, b=a2=1, f+1¡ —z 
F > ? = 1-— 1 F | .2-1 
( b :) a) ( b, f [= 3 pu 


The second is similar to Euler's identity, for g = [(b — a, — 1) (b — az — 1)e] / [(b — a4 — a2 — 1) e + a4 a2], 


1 b 1,6 1 1 
F ~ Ei e+ z) = (1 u A F ( 01 i Š a2 go | :) (36.2-11) 
v , 9 
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36.2.3 Quadratic transformations and Whipple’s identity 


The following transformations are due to Gauss: 


2a, 2b a, b 1 
F de e | :) = F (, ME Az(1 — 2) where |z| <5 (36.2-12a) 
a, b 2a, 2b |1— y1—z 
F : = F : | —— 36.2-12b 
TEL En. oe 
Rewriting relation |36.2-12al 
2a, 2b |1— b 
FE |=) Ful (36.2-13) 
à T b4- 2 2 a+b+ 2 
Whipple’s identity connects two hypergeometric functions 3 F»: 
la, la+ 1, 1-—a—b-c 4z a, b,c 
p Rae a | = (1-2F ae 36.2-14 
( l+a—b,l+a-c e) Ue lara al ) 


Specializing|36.2-14|for c = (a+ 1)/2 (note the symmetry between b and c so specializing for c = (b+ 1)/2 
produces the identical relation) gives 


a, b il la, a+ -b 4z 
pl. sor A HIM 2i 
al (1— z)? ( l+a—b 5) ipod 
2a 2 
b S Te 2a, bz 1- y1 
F( sap :) = PA 2) al A | ( 2) | (36216) 
asp v z a+b+>3 z 
By setting c := a — b in |36.2-15| we find 
a, a—=cC 1 la, L—la+c| —4z 
Be) = e A 34 
( 1+c 2) (1-20 ( l+e (1-2)? ió 
Similar to the relations by Gauss, from equations |36.2-15| and |36.2-16| we have: 
l-z 1+21* La iad 4 5 - 
F | m = p(3*2 | - gi 2- 
ee i) ( 2 ) 4 lta pa pos 


F a, b 1—-vy1—-2? u 1+y1- 22 34, la+ 4 —b 
l+a—b 1+vV1- 22 B CUM. l+a-—b 


Relation |36.2-18b|is found by setting x = V1 — z? (and replacing x by z) in|36.2-18a| The same is true 


for the next pair of relations: 


2 (36.2-18b) 


a, b 2 2a, a — 5 1—-z 
, 1-22) = F oe 24 
NUN 2) (= 2 | eee =) ee) 
b 2 2 bij 1-y1- 
«( a, "D _ ( yr Gite 2| EN (36.2-19b) 
a+b+ 5 1+vV1—- 22 a+b+ 3 1+y41-2 
The transformations 
a, b a 4z 
F : = (peer a A .2- 
( :) ore) Gay od peana] 
E +2) —Ag 
= ayer {2% 247 | 2-2 
ee G a—b+1 (1— 5) id 
=bh+l +4 yz 
= T+ -2a p du 2 | A .2-2 
SE) NP E (1x yz)? ey 
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are given in [I] rel.15.3.26, p.561]. Specializing for a = b gives 


P GU 2) = (bt °F C at P =) (36.2-20d) 
= (1-z)*F (^^ 2- P acm) (36.2-20e) 
= (livz)"F (“| ari) (36.2-20f) 


Relation |36.2-20e}is found by setting c = 0 in relation|36.2-15| Observe that the hypergeometric function 
on the right side of relation |36.2-20e| does not change when replacing a by 1 — a. The next (3F) 


transformation is given in [240]: 


a,b, 1 
3 + a+ 4b, 2 


1 1 1 1 
Jl = pl2t292t351 
B L+la+1b, 2 
2 2 2? 


The following are special cases of this transformation: 


ar 


4z(1— 3) (36.2-21) 


1— l,lg,1—1 
a-a?r(* ; a 2) = rà? o 20 412) (36.2-22a) 
1 licia 1+ la, 1 
1-z)lrF(^ = p[5'29 'sx^^'4u- 36.2-22b 
atra) = r(*t 55 aaa) 60222) 
1 igi 34 
atra :) = gee e 1201-9) (36.2-22c) 
299 273% 


More quadratic (and cubic) transformations are given in [139] vol.1, pp.110-114] and [222], see also [163]. 
The nonlinear transformation (given in [273] p.21]) 


; 5s 
F le :) = (1-4 a u^ (36.2-23a) 
where 
te "P z= — (36.2-23b) 
and 
do = 1 (36.2-23c) 
d = mem (36.2-23d) 


2 (c — 2b) (n+1+ a) di. 4.1 + (n+ 2a) (n+ 2a +1 — c) d, 
(n 4-2) (n - 14 c) 


ds (36.2-23e) 


maps the complex z-plane onto the unit disc. Therefore the w-form of the series converges for all z Z 1. 


36.2.4 Clausen's product formula 


Clausen’s product formula [108] connects hypergeometric functions of type 23 F1 and 3F3: 


a, b 2 2a, a + b, 2b 
F ? F i ? 36.2-24 
| ae .) ee :) ( ) 
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A relation due to Goursat [164] p.416] is: 
1, b-1| M? 2a+1,a+b+1, 2b 4-1 
ra, k 3] S ares ,a+b4+1, 2b+ 
a+b+3 


a+b+3,2a+2b+2 
Two transforms from p.266], the second form is found by setting (a, b) > (a — 1/4, b — 1/4): 


b SEN l bp b ob b, 4 +b 
F( de? 22, paso E (36.2-26a) 
ato+s 2—g—b 


a tatb,2—a—b 


E udi ME YE pod AE TIE 
que ND 04 VD = 26. uM iu D (36.2-26b) 
a — à — 


:) (36.2-25) 


The following relations are given in [17] p.184]: 
a, b a, b 2a, 2b,a+b 
p“ gp. = F diis | 36.2-27 
(o -1 :) Pm 2) a el :) ( n 


a, b a,b—1 2a, 2b—1,a+b-1 
F ; F i = F y : | 36.2-27b 
(o -i z) pa -i z) fu cM -$ :) ( ) 


We note that relation |36.2-24] is the special case c = a + b of the identity [356] ex.16, p.298]: 


= = = k 
a elo e ma Y Ay —— #4 (36.2-28a) 
c+ 5 c+3 k 
A (eed) 
where the A, are defined by 
2a, 2b 2 
a+b—c , = k 
(1— ¿yate F ( re |) = x Arz (36.2-28b) 


For more relations of this type see p.84-87]. If a = b+ 4 in relation |[36.2-24] then (two parameters 


on the right side cancel) 
b4 1,0) AT 2b 4- 4, 2b 
ES = F dd 36.2-29 
| ee 3] ( 454-1 |? ( ) 
and the right side again matches the structure on the left. The corresponding function can be identified 


i —2b 
(see [68] p.190]) as Gy(z) := F e z) = (i= xi) . We have Gn m(2) = [Gs (z)]". 


Specializing relation |36.2-26b] for b = —a we find 


1 2 il 1 1 
F arte) E F 5 t 2a, 5, 5 — 2a 
1 1,1 


:) (36.2-30) 


For a = 0, z = 1 (or z a sixth power) this relation is an identity between the square of a sum of squares 
and a sum of cubes: 


272 3 
3 [ja 2 > "EE " l' = 139320392968... (36.2-31) 
n=0 | j=1 n=0 | j=l 


The relation can be found by setting a = 6 = 1/4 and y = 1/2 in exercise 16 in [356] p.298]. Setting 
a = 1/2 in exercise 28 in [356] p.301] we find that the quantity equals n/T (3/4)? = T(1/4)*/(41?). For 
the square root of the expressions (1.39320... = 1.180340...) we have [140] p.34]: 


2 
E 1,1 oF ni 
F G | 5) = | 5 e | = 1.180340599016... (36.2-32) 


n=— 00 
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36.2.5 The Kummer transformation 


The Kummer transformation connects two hypergeometric functions of type ¡F: 


a b 
en) F (4 -:) i JA 


The relation is not valid if both a and b are negative integers. In that case one obtains the Padé 
approximants of exp(z), see relation |32.2-13 on page 629 


We give more transformations where the number of the lower parameters is greater than or equal to the 
number of upper parameters. A transformation from ıFı to 2F3 is given by 


:) (36.2-33) 


a a a,b—a z? 
r(A rG- = F a 36.2-34 
p RAT le b 15, i-e 1)| 4 ( ) 
Setting b = 2a and using relation |36.2-33| gives 
a 2 a 2 
F(. 2) = F up 36.2-35 
rs): ants PP 2) (36.2-35) 
This relation (attributed to Preece) and also the following transformation are given in [355]. 
1 1 1 2 
5 +a $74 3 z 
A E = F 2 7 3: 
Ge :) Ca :) exp(z) NN. 3 (36.2-36) 
The relation 
b E MES b 1) 2? 
FS) F(,|-2) = F eae et "e (36.2-37) 
2a 2b Gobo a+b (4 
is given in [33] rel.2.11, p.246]. Setting b = a gives 
F( 2r ) = F A s (36.2-38) 
2a|? 9a| Y — a-- 1, 2al 4 T 
A generalization is [204] 
a a a, (k — 1)a gs 
Flo) F( 7 |-2) = P i E 36.2-39 
ka i ka ka, ika, ika + i 4 ( ) 
The following transformation [33] rel.2.03, p.245] connects functions oF; and 2F3: 
l(a-- b), L(a -- b — 1) 
F( |j F(.|) = F(? '32 E 36.2-40 
a i b : a,b, a6 —1 Z ( ) 


The relation is given in [139] vol.1, p.186] and also 


2 
P. 2) F( | -«) E F(, la, — (36.2-41) 
Splitting into even and odd parts gives 
z? z2 
Pl, :) = an a atl s) +a? (s op, a =) (36.2-42) 
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Setting b = a in relation|36.2-40| gives (cancellation of parameters on the right side) 


F : pi 5-314 36.2-43 
| @ 2] i [s :) oes) 
This relation together with|36.2-35| gives 
z? a— > 
c Ty = .2-44 
exp(z) F (, 7 ) F (5, S 22) (36 ) 


The following relations are derived from the preceding ones: 


F fa 2) F ba A = e ser en ' (36.2-45) 
Gl) rh) = [FE p) 


a, b,a+b-1 
a—i a—i NY 
F 2 F 2 |z) = ÍF E 36.2-46 
(. 2a—1 :) (. 2a—1 :) | (. 34, 3(a+1) 64 )| ( ) 


PI) ral- = [rGuuls)) w 
F(,* il) JN -i) = Ben ari =) ipaa) 
Fl, 2) Phi a :) = ; rl, ASI (36.2-48a) 
Fl, z) IHE :) T M a 12) (36.2-48b) 


36.3 Examples: elementary functions 


The ‘well-known’ functions like exp, log and sin are expressed as hypergeometric functions. In some cases 
a transformation is applied to obtain alternative series. 


36.3.1 Powers, roots, and binomial series 


ay H a 2) = x gu r(“ m) (36.3-1a) 
A 00 ee 


An important special case of relation |36.3-1a| is 


:) = Ya (36.3-2) 
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Further identities are 


(1 — 2z) (1 — 2)"7! 


a) 


Bo m 


(1 -= vz)" -= (1+ vz)™ 
2n/z 
1 1 
z = log ave 
2/z 1— yz 
The following identities are found by dividing relations |36.3-4b] and |36.3-4a 


(1&3 -ü- 2" Num (ES 
(1 +2)? + (1- 2)” 2 


(241, z AAA 
+1) : 


| 


E 
E 


Mie NIE NÍS 


— 
y 


n#0 
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(36.3-3a) 
(36.3-3b) 


(36.3-3c) 


(36.3-4a) 


(36.3-4b) 


(36.3-4c) 


(36.3-5a) 


22 
aa} (36.3-5b) 


Set in |36.3-3b| to obtain the first of the following identities. The second, given by Michael Somos [priv. 
comm.], can be found by swapping the lower and upper parameters and replacing 4z by 2/4. The Ck 


denote the Catalan numbers, see section on page 331] 


r 3,1  1-yl-=4z — 2 
2 2z i 


36.3.2 Chebyshev polynomials 
The Chebyshev polynomials are treated in section|35.2 on page 676| we have 


T,(1—2z) = E :) 
gig). as por =) 
Un(z) = A 


Relation |35.2-14 on page 679| written as T, (11 /n(2)) = z = id(z), shows that 


[-1] 1 1 
n, —n|1l—z -,—-|1—z 
2 2 2 2 


near z = 1 (here FU denotes the inverse function). 


1+vV1- 4z 


(36.3-6a) 


(36.3-6b) 


(36.3-7a) 
(36.3-7b) 


(36.3-7c) 


(36.3-8) 
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Hi = 2z 
Hə = Az? —2 
Ha = 82? — 12z 
Ha = 16z* — 482? + 12 
Hs = 322° — 1602? + 120z 
He = 6429 — 48024 + 7202? — 120 
Hz = 12827 — 134425 + 33602? — 1680z 
Hg = 25628 — 358429 + 1344024 — 134402? + 1680 
Ho = 5122? — 921627 + 483842? — 806402? + 30240z 
Hio 1024210 — 2304028 + 16128029 — 4032002* + 3024002? — 30240 


Figure 36.3-A: The first few Hermite polynomials. 


36.3.8 Hermite polynomials 
The Hermite polynomials H„ (z) can be defined by the recurrence 
Aynii(z) = 2z Hí(z) — 2n H, (x) (36.3-9) 


where Ho(z) = 1 and Hi(z) = 2z. The first few are shown in figure|36.3-A| For nonnegative integer n we 
have 


H,(z) = piis ec -z) (36.3-10) 


z2 


36.3.4 Exponential function and logarithm 


The exponential function is the hypergeometric with empty argument lists: 


œ Lk 
z 
exp(z) = F ( z) =N +7 (36.3-11) 
k=0 
For the logarithm we have 
1,1 c. (-1)5 2441 
log(l+z) = zF ( 5 :) = 2. T (36.3-12a) 
1 L1 2 
(EL uo game us] cos lia (36.3-12b) 
l-z 3 I=% 
For large arguments the following relation can be useful [1] p.68]. Set w = zarz» then 
z+a w? w’ 
log(z+a) = log(z)+ log = log(a) + 2w |1 + = + ES aces (36.3-13a) 
a 
3» 1 
= log(a) +2wF | ?; | w? (36.3-13b) 
2 
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36.3.5 Bessel functions and error function 
The Bessel functions J, of the first kind and the modified Bessel functions J, (as given in [1]): 
(2/2)" =z" 
= F —— .9-14 
In(2) : ad E (36.3-14a) 
(2/2)" z? 
Iy = F — .3-14b 
(2) n! n+1!1 4 yes ) 
F MEUS 36.3-14 
a) a icio») (36.3-14c) 
ok 
z 
F (i z) = - Yum (36.3-14d) 
k=0 
Error function (the Kummer transformation, relation |36.2-33 on page 693} gives relation |36.3-15b): 
z 1 
VT erf(z) :— 1 e" dt = zF :| - 2 (36.3-15a) 
t=0 5 
x gue els "Bm E (36.3-15b) 
d al a “Dk +1) = 
1 ok 
= Ez 2 E 
eE EE (36.3-15c) 
36.3.6 Trigonometric and hyperbolic functions 
Series for sine and hyperbolic sine are 
oo JE z2en 
sin(z) = zF ( 3 E =) - yc aa (36.3-16a) 
k=0 
©  „2k+1 
sinh(z) = Fly =) = Y o QE Tj (36.3-16b) 
m 
Applying the transformation |36.2-43 on page 694|to relation |36.3-16a| gives 
[sin(z)]? 2F E N = 2 (36.3-17) 
25 
For the cosine and hyperbolic cosine we have 
— z2 29 (-1)* 2k 
cos(z) = F ( | =) - (36.3-18a) 
il 4 2. (2k)! 
o = F(,I2) -Eg (36.3-18b) 
cosh(z) = F(,| ) = 36.3-18b 
414 < (2k)! 
inh 1 
xp(z) sine) = F ( ; 22) (36.3-19a) 
i 1 
exp( Neal ) _ p (5 - 2iz] (36.3-19b) 
Zz 
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Further expressions for the sine and cosine are 


sin(a z) Pais. 
= F 32 
a sin(z) 3 sin() (36.3-20a) 
+T] 2 
cos(az) = F 1 sin(z) (36.3-20b) 
2 
cos(a z) LL lia 
= F .3-2 
cos(z) | 2 sin(z) (36.3-20c) 
sin(a z) | 2sin(az) _ 142,1—2]|.. E 
asi(z)cosz) aman ^ * 3 sin(z) (36.3-20d) 


Relations for the hyperbolic sine and cosine are obtained by replacing sin — sinh, cos — cosh, and 
negating the sign of the argument of the hypergeometric function. For example, relation |36.3-20b| gives 


cosh(az) = 2 | santo?) (36.3-20e) 


1 
2 


The transformation |36.2-41 on page 693| (with a = 1/2 and a = 3/2, respectively) gives 


(z) cosh(z) = F ( =) y E ah (36.3-21a) 
COS[Z) COSA Z = aa = EE .9- 
1,2, 3| 64 E (4n)!/4" 
1 1 1 
= 1 E $ EE 36.3-21b 
6^ * 95207 ~ 7484400 ^ ( 
E 2 (=D yAnt2 
sin(z) sinh(z) = 2F ( | x) Bx =e (36.3-21c) 
13500) "La 
1 1 1 
2 6 10 14 
= dos 36.3-21d 
^ —90^ * 113400 ^ ~ 681080400 ^ ( ) 
36.3.7 Inverse trigonometric and hyperbolic functions 
Series for the inverse tangent and cotangent: 
l TE 
arctan(z) = a lo 1 mu Jmlog(1 + iz) (36.3-22a) 
— tz 
13 © (—])* z2k+1 
= zF E | -g|- —— (36.3-22b) 
2 k=0 g 
Pfaff's reflection (relation |36.2-7b) leads to 
z i, 4 2? 1 
arctan(z) = MER F 3 TEZ = arccos VIZ (36.3-22c) 
Z 1,1 2? 
= F(^ | by B6.2-7b 3-22 
al) wmm (36.3-224) 
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1 1 
arctanh(z) = z 183 = 
1 4 Co z2k+1 
= zF|^?. | 2| = 5 
3 
| 2 k=0 2k +1 
1, d 2 1 
aia): cem 2. Qk+ 1) 2H 
—1 1 
log(z) = 2 arctanh = 2 arccoth ae 
z+1 z—1 
1 
arccot(z) = arctan ( ) ELS log En 
2 2 z—i 
= lp 2» 1 = A pa 3 ED 
z 3 z? ui (2k +1) 22*41 
1 Li 1 1 
= —— F|? d = arcsin 
V1+ 22 | 3 1+ 2? 14 22 
2 1,1 1 
= — F | 
1+ 2? ( 3 [14 z) 
Series for the inverse sine and cosine are 
1L 
arcsin(z) = 2F|?,? 27] = arctan g 
2 — Y 


1,1 
ry) 
2 


E 1- /1-2 
zF | 


: by [36.2-12b 


1,1 
z ar (>, |) 
2 


The two latter relations suggest the following argument reduction applicable for the inverse sine (and 
tangent). Let G(z) = (1 — V1 — z)/2, then 


1 


crea) 
1 


Vi-z J1- GG) 


AS 
Eni 
ho 7 
p 
UA 
Il 
[I e a 
— 
| 
& 
> 
LLL 
J 
ins 
N 


arccos(z) 


T . 
——arcsin(z) = arccot 
2 -z 


699 


(36.3-23a) 


(36.3-23b) 


(36.3-23c) 


(36.3-23d) 


(36.3-24a) 


(36.3-24b) 


(36.3-24c) 


(36.3-24d) 


(36.3-25a) 


(36.3-25b) 


(36.3-25c) 


(36.3-26a) 


(36.3-26b) 


(36.3-26c) 


(36.3-27) 
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For the inverse hyperbolic sine we have 


1 1 
arcsinh(z) = log(z+V1+2?) = zF | a | — 2) (36.3-28a) 
2 
ET UA 2! E by B62-7b (36.3-28b) 
|» fl +2 3 1142 vee. f 
1,1¡1-vV1+ 22 
=P A] vpz 3-2 
, E a ) y 0125 (36.3-28c) 
The following relations follow from Clausen's product formula (relation |36.2-24 on page 691): 
2 2 
2 Z 1,1,1 z 
[arctan(z)] = er F ( 2.2 | T z) (36.3-29a) 
A 1 Li 1 
[arecot(z)] = ine F ( 32 [IF =) (36.3-29b) 
1,1,1 
[arcsin(2)J? = z? F ( 3 A 2 (36.3-29c) 
PE 


36.4 Transformations for elliptic integrals t 


We give relations between elliptic integrals defined in section|31.2|on page[600]in terms of hypergeometric 


functions. To avoid the factor 5, define K :— E and E:— Then we have 


2? T° 


1 1 " —1i 1 
K(k) = F DU | e) , Ek) =F ( a | e) (36.4-1) 
The product form given as relation on page is the transformation 
: 2 L1i|fi-kWy 
K(k) = cz Fe? (E 4-2 
(4) IFK | 1 (Y) eee) 
The relation can be written as 
- " 1-k 
K(k) = (1-cz(k)) K(z(k) where z(k):— PE (36.4-2b) 


Relation |36.2-20f on page 691) with a = > gives 


- 1 f 2vk 
K(k) = kK | —— .A- 
() 1+k ES (30a) 
From relations [36.2-12a] on page [690] and [36.2-30] on page we find 
a 11 X d.d 1/2 
K(k) = F En | ai) - 6 Ge 2 | 22] (36.4-4) 


'The following transformation is due to Kummer rel.13, p.129]: 


CB) GBEE) 


(36.4-5) 


4a+3 2a+5 
ans 2 25 [41-4 
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Setting a = 1/2 gives 


2 
E 4 5 1— V1—k? 
K(k) = d : (36.4-6) 
(1+ V1—k?) 1441-4 
We further set 
T —1 .1 
N(k) = F ( S | ie) (36.4-7) 
1 1 1 25 49 441 
= l+ ktk kê A +... 36.4-8 
4 64 T 256 us 16384 Ei 65536 + 1048576 id ( ) 
A special value is N(1) = 4/z. We have 
2 
~- 1+v1-—k? 531 ([1-v1-k?2 
E(k) = eo E pra + | —————— (36.4-9a) 
3 1 1+y1- k2 
O O A 
= px | de 
e(t ty - 
"n 1—k! 
= (1+2(k)) * N(z(k) where z(k):— > (36.4-9c) 
1+k’ 
Compare the last equality to relation |36.4-2b} Relations |36.2-20e| on page and [36.2-7b| on page 
give 
Y Ji pa. =P 
Nk) = vVi-RrF[ ++ AS 36.4-10a 
(k) ‘ace. (36.4-102) 
-1 8 2k V? 
= yr| * i = (=) (36.4-10b) 
1 k! 
J ie 2k V? 
= ug vi Aj 
ES i ( n z) (36.4-10c) 


More such transformations are given in [222] pp.145-148]. 


Applying the transformation |36.2-20f on page 691|on the defining relation for N (k) gives the key to fast 


computation of this function: 
~ ~ ( Wk 
N(k) = (1+k) E| — .4-11 
(k) = (+8) ES (36.4-11) 
The relation 


2E(k)—k? K(k) = Ñ(k) (36.4-12) 


can be used to rewrite Legendre's relation (equation |31.2-23 on page 604) as either of the following (set 


N :=7/2N in the second identity): 


— = NE'+KN-KK', 2w=NK'+KN'-KK' (36.4-13) 
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We note the following generalization of Legendre’s relation given in [139] vol.1, p.85] (and [I7] p.138]): 


rl. + PEL PTS _ (36.4-14) 
A A 
+ p we :) rl pec t-z) - 
UL ) dp T-2) 


Setting a — b — c — 0 gives Legendre's relation. Setting c — a and b — —a gives [66] rel.5.5.6, p.178]: 


1 2 cos(a rr) 
2 = 36.4-15 
TEFOTE=D - 1+2 em) 
Jp gs 1 pl 1 
+ are J) 7(*5 m DE 
1 1 
pido. 1_ 1 1 
+r ( sa; +53—a :) F( 5—4, +5 +0 1-2) - 
1 1 
Ll 1 
-r ( 370 +30 pra % t2 “|1-2) 
1 1 
36.5 The function x” t€ 
Boldly setting a = 1 + z in (1 + 2)" =p(" - 2) (relation |36.3-1b on page 694) gives 
ex]: 
(14-2)079 = F( i -: = exp[(1 + z) log(1 + z)] (36.5-1a) 
Œ+) (2 +0) (2-1) P @-2_f,, 
1 pw ee ie eee ias (36.5-1b) 
u í1,2,1,3,1,, los, 3,6 1 7, 59 s Tl 9, o 
l+2+2 F5* t37 t15* tp 120^ + 35257 50407 t^ (36.5-1c) 


This somewhat surprising expression allows the computation of z” without computing exp() or log(). 
The series converges for real z > 0 so we can compute x” (where x = 1 + 2) for real z > +1 as 


a? = F ND -2+1) (36.5-2a) 


= 14 1 (x — 1) E z (2-1) ik (21) jit. Jl (36.5-2b) 


We denote the series obtained by truncating after the n-th term of the hypergeometric series by gj (1). 
For example, with n — 2 and n — 4 we find 


A 


1 3 
g2(x) 397 —39 tae —524 1 (36.5-3a) 
= > + > gu +z+1 (36.5-3b) 
Le 5 7 156 19 5 qbos 3l 3 1312 2 
= | | T | 1 .9- 
9a(z) 24% — 12 8 4" e cu 24 7 — 12 oe) 
la 17,16, 1 5 4 3 


h2 
A 
= 
N 
00 
mm 
h2 
wI = 
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We have gn(n) = n” and further g,(k) = k* for all integer k < n. Thus we have just invented a curious 
way to find polynomials of degree 2n that interpolate k^ for 0 < k < n (setting 0° :— 1 for our purposes). 


The polynomials actually give acceptable estimates for x” also for non-integer x, especially for x near 1. 
The (unique) degree-n polynomials ¿,(1) that are obtained by interpolating the values k* have much 
greater coefficients and give values far away from z^ for non-integer arguments 2. 


For 0 < x < n the interpolating polynomials ¿,, (1) give an estimate that is consistently worse than g, (a) 
for non-integer values of x. The same is true even for the polynomials i»,(z) that interpolate k* for 
0 € k < 2n (so that deg(i2,) = deg(g,) = 2n). In fact, the i2, (x) approximate consistently worse than 
in (x) for non-integer x. 


Finally, the Padé approximants Pin n] (x) for gn(x) give estimates that are worse than with both i, (x) or 
gn(x). Further, g, (x) 4 x” even for integer x and the Pj, ,j(r) have a pole on the real axis near x = 1. 
That is, we found a surprisingly good and compact polynomial approximation for the function x”. 


The sequence of the n-th derivatives of (1 + z) 1^? at z = 0 is entry A005727 in [312]: 


? Vec(serlaplace(exp((1*z)*1log(1*z)))) 
[1, 1, 2, 3, 8, 10, 54, -42, 944, -5112, 47160, -419760, 4297512, ... ] 


Many other expressions for the function x? can be given, we note just one: set n = 2+z in relation|36.3-3al 


on page 695|to obtain 


if =2-2,2+3 
Q2) — | E a 
(1-4 z) F ( z242 :) (36.5-4) 
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Chapter 37 


Cyclotomic polynomials, product 
forms, and continued fractions 


We describe the cyclotomic polynomials and some of their properties, together with the Mobius inversion 
principle. We also give algorithms to convert power series into Lambert series and infinite products. 
Continued fractions are described together with algorithms for their computation. 


37.1 Cyclotomic polynomials, Mobius inversion, Lambert series 


37.1.1 Cyclotomic polynomials 
The roots (over C) of the polynomial x" — 1 are the n-th roots of unity: 


n—1 


g^-1- II fe- e» (225) (37.1-1) 


k=0 


The n-th cyclotomic polynomial Y,, can be defined as the (monic) polynomial whose roots are the primitive 
n-th roots of unity: 


Yate) = JI E — exp (=) (37.1-2 


k=0...n—1 
gcd(k,n)=1 


Most sources use D,,(1) for the cyclotomic polynomials, we use Y because ® is overused. The degree of 
Yn equals the number of primitive n-th roots, that is 


deg(Y,) = (n) (37.1-3 


The coefficients are integers, for example, 
Yesla) = Bart (37.1-4) 


The first 30 cyclotomic polynomials are shown in figure The first cyclotomic polynomial with a 
coefficient not in the set (0, +1} is Yos: 


Yios(z) = v8 tat? par — 743 — ot? 2. ptt — 0 — 39 LU (37.1-5) 


The cyclotomic polynomials are irreducible over Z. All except Yı are self-reciprocal. 


For n prime the cyclotomic polynomial Y, (x) equals (z" — 1)/(z — 1) 2 z"^! + x°”? E ...-E x F1. For 
n = 2k and odd k > 3 we have Y,(x) = Yp(—x). For n = pk where p is a prime that does not divide k 
we have Y,(x) = Y (1?)/Y,(x). The following algorithm for the computation of Y,,(x) is given in [154] 
p.403]: 
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n: Yn (x) 

1: x 1 

2: x +1 

3: x72 +x +1 

4: x72 + 1 

5: x^4* x83^*x'2*x-*1 

6: x2 x1 

T: x^6 + x75 + x4 + x34 x'2* x *1 

8: x74 + 1 

9: x76 + x73 + 1 

10: x°4- x73 + x72 -x+1 

11: x^10 + ... + 1 «--- all coefficients are one for prime n 
12: x°4- x72 + 1 

13: X 12 + Fd 

14: x'6-x^b-x4-x3-^x2-x-*1 

15: x^8 - x°7 + x b - x°4+ x3-x *1 

16: x78 + 1 

17: x^16 +... +1 

18: x76 - x73 + 1 

19: x^18 +... + 1 
20: x78 - x°6 + x74 - x°2 + 1 
21: x^12 - x11 + x79 - x78 + x76 - x74 4+ x73 -x+1 
22: x710 - x79 + x°8 - x77 + x76 - x75 + x74 - x73 + x72 -x4+1 
23: x722 +... + 1 
24: x78 - x74 +1 
25: x^20 + x715 + x710 + x"5 + 1 
26: x^12 - x711 + x710 - x79 + x78 - x77 + x76 - xb + x74 - x73 + x72 -x+1 
27: x718 + x79 + 1 
28: x712 - x10 + x°8 - x°6 + x4- x°2 + 1 
29: x^28 + ... + 
30: x8 + x77 -xB5-x4-x3+x+1 


&«O 00-1 oO» Ot iR Ob. A 


Figure 37.1-A: The first 30 cyclotomic polynomials. 


1. Let [p1, pa,...,Pr] the distinct prime divisors of n. Set yo(x) = x — 1. 
2. For j = 1,2,...,r set y;(x) = y¿(1P5)/y,(x) (the division is exact). 
3. Return y, (z^/ (P1 P2 Pr)) 


The last statement uses the fact that for n = kt where all prime factors of k divide t we have Y, (x) = 
Y, (z^). An implementation is 


polcyclo2(n, z=’x)= 
1 


local(fc, y); 

fc = factor(n)[,1]; \\ prime divisors 

y"z-1; 

for (j=l, #fc, y=subst(y,z,z°fcl[j])\y; ni=fc[jl; ); 
y = subst(y, z, zn); 

return( y ); 


} 


Note that the routine will only work when the argument z is a symbol. 


37.1.2 The Mobius inversion principle 
The Mobius function u(n) is defined for positive integer arguments n as 


0 if m has a square factor 
p(n) :— (-1)* if mis a product of k distinct primes (37.1-6) 
+1 if n=1 


The function satisfies 


Y u(d) = i ifn=1 (37.1-7) 


0 otherwise 
d\n 


The Mobius function can be expressed as a sum of the primitive n-th roots of unity [139] vol.3, p.173]: 


p(n) = »3 exp (2 i k/n) (37.1-8) 
k=0...n—1 
gcd(k,n)=1 
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niu) [ ninm) | nun) [ nan) [ nn [ nine) [ nun) [ nan) 
1: +1 11: —1 21: +1 31: —1 41: —1 51: +1 61: —1 71: —1 
2: —1 12: 0 22: +1 32: 0 42: —1 52: 0 62: +1 72: 
3: —1 13: —1 23: —1 33: +1 43: —1 53: —1 63: 0 73: —1 
4: 0 14: +1 24: 0 34: +1 44: 0 54: 0 64: 0 74: 
5: —1 15: +1 25: 0 35: +1 45: 0 55: +1 65: +1 75: 0 
6: +1 16 0 26: +1 36: 0 46: +1 56: 0 66: —1 76: 0 
7: —1 17: —1 27: 0 37: —1 47: —1 57: +1 67: —1 TT: 1 
8: 0 18: 0 28: 0 38: +1 48: 0 58: +1 68: 0 78: — 
9: 0 19: —1 29: —1 39: +1 49: 0 59: —1 69: +1 79: —1 
10: +1 20: 0 30: —1 40: 0 50: 0 60: 0 70: —1 80: 0 


Figure 37.1-B: Values of the Möbius function u(n) for n < 80. 


The sequence of the values of the Möbius function (see figure|37.1-B| is entry A008683 in [312]. 


A function f(n) is called multiplicative if f(1) = 1 and f satisfies 


f(n:m) = f(n)-f(n) if ged(n,m) = 1 (37.1-9) 
For a multiplicative function one always has f(n) = f(pi*)- f(p5*)- -..- f(p;^) where n = pt - p5?-...-py* 
is the factorization of n into prime powers p;'. If the equality holds also for gcd(n, m) # 1 the function 
is called completely multiplicative. Such a function satisfies f(n) = f(p1)2 - f(p2)@- ...- f(py)**. Fora 
multiplicative function f we have (subject to convergence, see vol.3, p.169]) 
S fín) = ]| [+++ (0) +1 (0) +... (37.1-10) 
n=1 p 


where the product on the right side is over all primes. If f is completely multiplicative, then f (p*) = 
f (p)* and 


= 1 
Y fm = II rad) 
= > 1 f(p) 
The Móbius function is multiplicative: 
p(n-m) if ged(n, m) 21 
BODEN) = { 0 otherwise Piri) 


We now state the multiplicative version of the Mobius inversion principle: 


gm) = [[£à = fm = [[aa e! (37.1-13) 


d\n d\n 


For example, for the cyclotomic polynomials we have 


x -1 = [[v» (37.1-14) 
d\n 
Mobius inversion gives 
Yala) = [et-a (37.1-15) 
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The relation implies a reasonably efficient algorithm for the computation of the cyclotomic polynomials. 
The method also works when the argument x is not a symbol (x must be different from the n-th roots of 
unity). 


Relation |37.1-14| implies (considering the polynomial degrees only and using relation |37.1-3) 


X gd) (37.1-16) 
d\n 

while relation [37.1-15| corresponds to the equality 
5 d u(n/d) (37.1-17) 
d\n 


Relations |37.1-16| and |37.1-17|are a special case of the additive version of the Mobius inversion principle: 


= Y f(d) = f(me-Mls(d)nu(n/d) (37.1-18) 


d\n d\n 


More general, if h is completely multiplicative (see [123] vol.1, p.447]), then 


= M f(d)h(n/d) = Y g(d) h(n/d) (n/d) (37.1-19) 


d\n d\n 


Setting h(n) = 1 gives relation |37.1-18 


We note two relations that are valid for multiplicative functions f (see [285] sect.4.6.4, facts.4+5]): 


Nudra = TT a-fae» (37.1-20a) 
d\n d\n, d prime 
Y ula) f(d) = I] aa» (37.1-20b) 
d\n d\n, d prime 


Relation |37.1-20a| with f(n) = 1/n gives relation |37.1-16| and also 
1 
= 1—-- 37.1-21 
an) - II (1-3) (37.121) 


d\n, d prime 


We give two more inversion principles, taken from [176] p.237, thm.268-270]. For x > 0 


g(t) = Y fem) = fle) = Y nin) g(z/n) (37.1-22) 
gin) = Y fn) e fin) = Y glen) ulk) (37.1.23) 
k=1 k=1 


A much more general version of the Móbius inversion principle is described in [86], see also : 


37.1.3 Lambert series 


A Lambert series is an expansion of the form 


I) = Yl Ca = Y Y agar (37.1-24) 


k>0 k>03>0 


O oND A WHF 


ONIDoBWNr 
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It can be converted to a power series 


L(a) = DN bya* where by =) aa (37.1-25) 


k>0 d\k 


The inversion principle can be used to transform a power series to a Lambert series: 


ay = Y ba u(k/d) (37.1-26) 
d\k 
The conversion can be implemented as 
ser2lambert (t)= 
/* Let t=[al,a2,a3, ...], n=length(v), where t(x)-sum (k-1)^ín)h(a k*x^k); 
* Return L-[11,12,13,...] so that (up to order n) 
* t(x)-Nsun (j213^(nH  j*x^j/(17x^j23 
*/ 
local(n, L); 
n = length(t); 
L = vector(n); 
for (k=1, n, fordiv(k, d, L[k]-*-2moebius(k/d)*t[d]); ); 
return( L ); 
} 
The conversion in the other direction is 
lambert2ser (L)= 
{ /* inverse of ser2lambert() */ 
local(n, t); 
n = length(L); 
t = sum(k=1, length(L), 0(’?x*(n+1))+L[k]*’x*k/(1-’x*k) ); 
= Vec(t); 
return( t ); 
} 


For the Lambert series with aj, = 1 for all k we have [109] p.95] 


k at kj L+a* qa 
So d(k)a* = ona S M que 2a e (37.1-27) 


k>0 k>0 k>0 ¿>0 k>0 
where d(k) is the number of the divisors of k, entry A000005 in [312]. More generally, we have 
keg" 
k>0 k>0 


where ce(n) is the sum of the e-th powers of the divisors of n. Let o(n) denote the numbers of odd 


divisors of n (sequence A001227 in [312]), then 
2k—1 k k (k4-1)/2 
k £ = z x 
2,05)2 = Pi—umei = Di ae = Dre (37.1-29) 


k>0 k>0 k>0 k>0 


The first of the following relations is given in [214] p.644, ex.27]: 


p = Y ks TJ (1-%)| => 1- [UG 727) (37.1-30a) 


k>0 k>0 j>k+1 k>0 j>k 


For the Lambert series with az = u(k) we have 


k) z^ 
c= SH ) (37.1-31) 
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For az = a* we have 


ar gr a z* 
= AA us 1-32 
2a T. Vin) 
k>0 k>0 
This is given in [209] p.468], also the following: 
if 5 (a, z^) / (1— x") = f(x) and No ap zë — g(x) then f(x =X gt" (37.1-33) 
k>0 k>0 k>0 


We note a relation that is useful for the computation of the sum, it is given in [214] p.644, ex.27]: 


5 a Sapt 5 (ak + a4) z^? (37.1-34) 


k>0 j>0 


For the related series 


Pla) = MI = -3 Y Cay a a? (37.1-35) 


k>0 k»0 j>0 
we find (by computing the k-th term on both sides: az z*/(1-4- z^) = ap z*/(1— x") — 2 ap x?* /(1— a?) 
P(x) = L(zr)-2L(z?) (37.1-36) 


The other direction is obtained by repeatedly using L(x) = P(x) + 2 L(x?): 


S PUES (37.1-37) 
Use relations |37.1-34] and |37.1-36] to find 
23 2 y (1 — 24 y+ )- M (ak + ak+j) (at - para (37.1-38) 
k>0 j>0 


37.2 Conversion of power series to infinite products 


37.2.1 Products of the form [[,., (1 — 24) 


Given a series with constant term one, 


fía) = 14+ ar" (37.2-1) 
k>0 


we want to find an infinite product such that 
fa) = [[a-s9* (37.2-2) 
k>0 


We take the logarithm, differentiate, and multiply by z: 


Qi ir yo (37.2-3) 


F(a) ip. Pow 
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The expression on the right side is a Lambert series with coefficients —k bz, the expression on the left is 
easily computable as a power series, and we know how to compute a Lambert series from a power series. 
We have 


be = -> axutk/d) (37.2-4) 


where the az are the coefficients of the power series for z f’(x)/f(x). The conversion to a product can 
be implemented as 


1  ser2prod(t)- 
2 
3 /* Let t=[1,a1,a2,a3, ...], n-length(v), where t(x)-1*sum ík-1)^(nh(a k*x^k); 
4 * Return p=[p1,p2,p3,...] so that (up to order n) 
5 * t(x)=\prod_{j=1}*{n}{ (1-x*j)“{p_j}} 
6 */ 
7 local (v) ; 
8 v= Ser(t); 
9 v = v/v; 
v= vector (#t-1, j», polcoeff (v, j-1)); 
v = ser2lambert(v); 
v = vector(#v, j, -v[j]/j); 
13 return( v ); 
14 $ 


A simple example is f(x) = exp(z), so x f’/f = x, and 


1/2 1/3 1/5 

yaa | (1-27) (1-22) ^ (1722) ^... 
exp(r) = l-z = (37.2-5) 

( ) Ht ) (1— gi) (1 — x6)1/6 (1 — g10)1/10 =. 

Taking the logarithm, we find 
k , 
z= -Y n ) log (1 = 5) (37.2-6) 
k>0 


Setting f(x) = 1 — 22 gives relation |18.3-6a| on page [380] (number of binary Lyndon words): 


? ser2prod(Vec(1-2*x*0(x^20))) 
[2, 1, 2, 3, 6, 9, 18, 30, 56, 99, 186, 335, 630, 1161, 2182, 4080, 7710, 14532, 27594] 


Setting f(x) = 1 — x — 2? gives the number of binary Lyndon words without the subsequence 00 (entry 
A006206 in ): 


? ser2prod(Vec(1-x-x^2*0(x^20))) 
[1, 1, 1, 1, 2, 2, 4, 5, 8, 11, 18, 25, 40, 58, 90, 135, 210, 316, 492] 


The ordinary generating function for the e; corresponding to the product form f(x) = [| (1 — z*)** is 


Y uat = ZU log (f (a*)) (37.2-7) 


This can be seen by using the product form for f on the right side, using the power series log(1 — x) = 
M + 2 1 2+ 2?/3 +...) and using the defining property of the Möbius function (relation |27.1-7 on 


An example is relation |18.3-6b on page 380| For the cyclotomic polynomials we find (via 
on B 1-15 on page 706): 


O og ( Ya (2) = Y jes (37.2-8) 


k= 


Rp 
a 

= 
3 
3 


For example, by setting n = 2 we get 


z? -r = — Y A log (1 + z^) (37.2-9) 


&O 00-10» Ot AWN 
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37.2.2 Products of the form [],., (1+ z^)* 


For the transformation into products of the form [T (1 + ahy we set 


fa) = [[0s9* (37.2-10) 
k>0 
and note that 
Pla) _ (+k cp) z* 
o T 2. qub (37.2-11) 


So we need a transformation into series of this type. As the Móbius transform is not (easily) applicable 
we use a greedy algorithm: 


ser2lambertplus(t)- 


1 
/* Let t=[al,a2,a3, ...], n-length(v), where t(x)-sum (k-1)^ín)h(a k*x^k); 
* Return L=[11,12,13,...] so that (up to order n) 
* t(x)=\sum_{j=1}*{n}{1_j*x*j/(1+x7j)} 
*/ 
local(n, L, k4); 
n = length(t) ; 
L = vector(n); 
for (k=1, n, 
tk = t[k]; 
L[k] = tk; 
\\ subtract tk * x"k/(1+x"k): 


forstep(j-k, n, 2*k, t[j] -= tk); 
forstep(j=k+k, n, 2*k, t[j] += tk); 
return( L5 
Now we can compute the product form via 
ser2prodplus(t)- 
1 
/* Let t=[1,a1,a2,a3, ...], n-length(v), where t(x)-1*sum (k-1)^(nh(a k*x^k]; 
* Return p=[p1,p2,p3,...] so that (up to order n) 
* t(x)=\prod_{j=1}°{n}{(1+x*j) “{p_j}} 
*/ 
local (v) ; 
v = Ser(t); 
v = v!/v; 
v = vector(#t-1, j, polcoeff(v, j-1)); 
v = ser2lambertplus(v) ; 
v = vector(#v, j, v[j]/j); 
return( v ); 
} 


A product [],59 (1 — z^)"* can be converted into a product [T,;..9 (1 +x*)% via the relation (1 — x) = 
Hess (1 ES al, 


37.2.3 Conversion to eta-products 


Define the eta function (or y-function) via 
nx) := II (1— a7) (37.2-12) 
The conversion of a power series to a product of the form (eta-product, or n-product) 
II n (xt) (37.2-13) 
k=1 


can be done as follows: 
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ser2etaprod(v)= 


/* Let t=[1,a1,a2,a3, ...], n-length(v), where t(x)=1+sum_{k=1}*{n}{a_k*x“k}; 
Return p=[p1,p2,p3,...] so that (up to order n) 
t(x)=\prod_{j=1}°{n}{eta(x*j) “{p_j}} 

* where eta(x) = prod(k>0, (1-x^k)) 


* 


00D OA Nun 
* 


*/ 

local(n, t); 

v = ser2prod(v); 
10 n = length(v); 
11 for (k=1, n, 
12 t = vik]; 
13 forstep (j-k*k, n, k, v[j]-=t; ); 
14 ; 
15 return( v ); 
16 


Similarly, to convert into a product of the form 


oo oo 2 
II (at) where m,(x) := ][ G+) = nt) (37.2-14) 
k=1 j—1 n(x) 
a J= 
use 

1  ser2etaprodplus(v)- 

2 4 

3 /* Let t=[1,a1,a2,a3, ...], n-length(v), where t(x)-1*sum (k-1)^í(nh(a k*x^k]; 

4 * Return p=[p1,p2,p3,...] so that (up to order n) 

5 * t(x)=\prod_{j=1}"{n}{eta_+(x7j)“{p_j}} 

6 * where eta *(x) = prod(k>0, (1*x^k)) 

7 */ 

8 local(n, t); 

9 v = ser2prodplus(v) ; 

10 n = length(v); 

11 for (k-1, n, 

12 t = vik]; 

13 forstep (j=k+k, n, k, v[j]-=t; ); 

14 R 

15 return( v ); 

16 $ 


The routines are useful for computations with the generating functions of integer partitions of certain 
types, see section |16.4 on page 344 


We note two relations with Lambert series taken from [209] p.468]: 


cd gw EUR 1 Ber 
n(x) = exp (- 5 TI z) = exp | — y T 5 i] =|- 5 P 5 d | (37.2-15a) 
kel k=1 — d\k k=1 dk 
my(x) = exp Y E Um = exp Na (1 — a^) p» (37.2-15b) 
+ px jw 4e d k=1 d\k 2 l 


oo k—1 oo oo 
nx) = 1-Soa* [[a-2)21-M [[ (1-2) (37.2-16a) 
k=1 j=l k=1 — j=k+1 
oo k—1 oo oo 
qum) = 14) 2* (+r) =1+ 5 z^ II (1 +27) (37.2-16b) 
k=1 j=l k=1  j=k+1 
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2: ( E273 ) / (E4) 

3: ( E374) / (E9) 

4: ( E4°7 ) / ( E873 ) 

5: ( E576 ) / ( E25 ) 

6: ( E6*12 E36 ) / ( E12^4 E1873 ) 

7: ( E778 ) / ( E49 ) 

8: ( E8715 ) / ( E1677 ) 

9: ( E9713 ) / ( E2774 ) 

10: ( E10^18 E100 ) / ( E20^6 E5073 ) 

11: ( E11^12 ) / ( E121) 

12: ( E12°28 E7273 ) / ( E24^12 E3677 ) 

13: ( E13^14 ) / ( E169 ) 

14: ( E14^24 E196 ) / ( E28^8 E9873 ) 

15: ( E15^24 E225 ) / ( E45^6 E7574 ) 

16: ( E16^31 ) / ( E32^15 ) 

17: ( E17^18 ) / ( E289 ) 

18: ( E18^39 E108^4 ) / ( E36^13 E54712 ) 

19: ( E19^20 ) / ( E361 ) 
20: ( E20^42 E20073 ) / ( E40718 E10077 ) 
21: ( E21732 E441 ) / ( E63^8 E147^4 ) 
22: ( E22^36 E484 ) / ( E44^12 E24273 ) 
23: ( E23°24 ) / ( E529 ) 
24: ( E24^60 E144^7 ) / ( E48°28 E72715 ) 
25: ( E25^31 ) / ( E125^6 ) 
26: ( E26°42 E676 ) / ( E52^14 E33873 ) 
27: ( E27°40 ) / ( E81^13 ) 
28: ( E28^56 E39273 ) / ( E56^24 E196°7 ) 
29: ( E29730 ) / ( E841 ) 
30: ( E30^72 E180^6 E300^4 E450^3 ) / ( E60^24 E90^18 E150^12 E900 ) 
31: ( E31^32 ) / ( E961 ) 
32: ( E32^63 ) / ( E64^31 ) 
33: ( E33^48 E1089 ) / ( E99^12 E363^4 ) 
34: ( E34^54 E1156 ) / ( E68^18 E57873 ) 
35: ( E35^48 E1225 ) / ( E175^8 E245^6 ) 
36: ( E36^91 E216^12 ) / ( E72^39 E108728 ) 
37: ( E37^38 ) / ( E1369 ) 
38: ( E38^60 E1444 ) / ( E76720 E722^3 ) 

39: ( E39^56 E1521 ) / ( E117^14 E50774 ) 
40: ( E40^90 E40077 ) / ( E80^42 E200715 ) 
41: ( E41^42 ) / ( E1681 ) 
42: ( E42^96 E252^8 E588^4 E882^3 ) / ( E84^32 E126^24 E294^12 E1764 ) 
Figure 37.2-A: Expressions for ID. n(w q) as products of 7-functions, Ek denotes n(q^). 

2: (E23) / ( E1 E4 ) 

3: (E374) / C El E9 ) 

4: (E478) / ( E273 E873 ) 

5: ( E5°6 ) / ( E1 E25 ) 

6: ( E1 E4 E6712 E9 E36 ) / ( E273 E3^4 E1274 E1873 ) 

7: ( ET^8 ) / ( E1 E49 ) 

8: ( E8718 ) / ( E477 E1677 ) 

9: ( E9714 ) / ( E3^4 E2774 ) 

10: ( E1 E4 E10^18 E25 E100 ) / ( E273 E5^6 E20^6 E5073 ) 

11: ( E11712 ) / ( E1 E121 ) 

12: ( E2^3 E8^3 E12^32 E18^3 E72^3 ) / ( E4^8 E6712 E24^12 E36^8 ) 
13: ( E13^14 ) / ( E1 E169 ) 

14: ( E1 E4 E14^24 E49 E196 ) / ( E2^3 E778 E28^8 E9873 ) 

15: (El E9 E15^24 E25 E225 ) / ( E3^4 E5^6 E45°6 ET75^4 ) 

16: ( E16°38 ) / ( E8^15 E32^15 ) 

17: ( E17^18 ) / ( El E289 ) 

18: ( E3^4 E12^4 E18^42 E27^4 E108^4 ) / ( E6712 E9^14 E36^14 E54^12 ) 
19: ( E19720) / ( El E361 ) 

20: ( E2^3 E8^3 E20^48 E5073 E200^3 ) / ( E4^8 E10718 E40718 E100^8 ) 
21: ( E1 E9 E21732 E49 E441 ) / ( E3^4 E7^8 E63^8 E147^4 ) 

22: (El E4 E22736 E121 E484 ) / ( E2^3 E11^12 E44^12 E242^3 ) 

23: ( E23724 ) / ( El E529 ) 

24: ( E4^7 E16^7 E24^72 E36^7 E144^7 ) / ( E8^18 E12^28 E48^28 E72718 ) 


Figure 37.2-B: Expressions for Weenies (uw q) as products of y-functions, Ek denotes y(q*). 
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37.2.4 Some identities for the eta function 1 


The given routines let us discover identities like the following: let w = exp (27 i/n), a(n) the sum of the 
divisors of n, and y the Mobius function, then 


= [jnd = [nu "9 (37.2-17) 
j= d\n 


See figure |37.2-A] for such identities with small n. Special cases are 


n\2n-1 
Pala) = na”) EI for n a power of 2 (37.2-18a) 
mer) 
pyp+1 
P,(q) um E for p prime (37.2-18b) 
n (ar?) 
We further have 
Tq) = II = J] Paget’? (37.2-19) 
j=1, ged(j,n)=1 d\n 


and by Möbius inversion, P,(q) = [Ln Tala). From the first few such relations (see figure 37.2-B) we 
get, writing Ep for n(q*) where convenient: 


3 
na = zm (37.2-20a) 
a . BE n (q?) 1 
MY = E~n (37.2-20b) 
"n(-a)m,(-q) = E» = m(a)n,(a) (37.2-20c) 
ES 
n(+igq) n(—iq) = EE (37.2-20d) 


Product expansions of the real and imaginary parts of n(iq) and ^! (iq) are 


A (i 0 _ II (1 _ T did (1 _ qe] (1 _ or) (37.2212) 


i qi6n—2) (1 — q16n-14) 


. oo (1 —q en) (1 m penc (1 _ g 9 
9g II (1 — qi6n—6) (1 — qi6n—10) 
oo n—2)\2 n—6y3 n— 3 n— 2 
E TER ii (1 — gl$ 2) (1 - "I : (1 — q'6 10) (1 — q!6 14) 
(1 — glón) (1 — g16n— 4y4 (d qi6n—8)? ae g16n-12)* 
T il (1 — gi6n— 2)* (1 (i. gión 6)? (1 = gi9n-10)? (1 B gHn-14)* 
(1— qi6") (1— q16n-4)* (es qi6n—8)? 18 g16n-12)* 


(37.2-21b) 


Many identities from section [31.3.3] on page [607] can be rediscovered (without the need of elliptic func- 
tions) by determining the 7-products for the absolute value and the real and imaginary parts of certain 
expressions. The relation for the eta function then follows from lal? = Re? a + Jm” a. For example, for 
a :— n? (iq) we find 


27- 4 Dg . 4 416 21: 112 4 
= 2 d m 37.2-22 
1) (i q) E> EP. 14 E, E? , In (i q)| ES ES ( ) 
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Writing |a|? = Re? a + Jm”(a) in terms of the n-products gives 


EJ? — E, Eg d 2 Ej Ei, : 
ESES| — |E: E% Es E? 


We substitute q? by q and give the resulting relation in two forms: 


BYES = E{ El +4q Ef EG EG E8 
B Jf Es |°, n 1 i 
EPE,E2]] © (mm R] |^ | |B? Ey Ee 
For a := nf (iq) we find 
E2 El El E2 E32 
a = A Aig 43, la? = > "m 
ER Ez E» Ej 
This gives identity |31.3-32a| on page [609] 
E29 = EPEI+16qE Ej? 
Z4 = X! Y’+16q XY! where 
Z= E, X=E, Y =E, 
For a := n(iq)/n(iq?) we find 
E, Eg Es E23, . Eo E, EP Ed, la? E} Es ES 
E VA =: “>=. => 
É ESE | Bs Ela C Ej Ej ES, 
'This gives 
E$ EÊ = E, E3 E} Ej+qE| E; E, Ej, 
| Ez Eg | u EA ES 
E, Ez E, E15 E, Ei» Ez Ea 
Z = XY°+qYX°* where 
Z = £2 Ee, X = E] Exp, Y = Ez Éa 
Using a := n? (i q)/n((iq)?) = n? (iq)/n(—i q?) we find 
a= o ee ee 
Ei E5 Eg Ej» E» Eg Ej 
'This gives 
Ej E3 Ej, = E? Ey Eg + 9q E} Ez Es Ej Ez Ej) 


2 
Ez Ei]? _ | E2 Es Eo Er 6 -9q E3 Ez Es Ej» 
Ey Es El E? E? El 
For a := (i q)/n(iq?) we find 
= Ej Es Els E54 E? iq Ea Ej Es Ez | ? = Ej Els E? 
E» Es Eg Es E24 E36 Ey E3 Es 


This gives 


E$ E2 E2, EÍ = EEE Eg Ef, Bi, Bag +q E? El Es Es Eso 
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(37.2-23) 


(37.2-24a) 


(37.2-24b) 


(37.2-25) 


(37.2-26a) 
(37.2-26b) 
(37.2-26c) 


(87.297) 


(37.2-28a) 
(37.2-28b) 


(37.2-28c) 
(37.2-28d) 


(37.2-29) 


(37.2-30a) 


(37.2-30b) 


(37.2-31) 


(37.2-32) 
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For a := n(iq)/n(i q?) we obtain 
FLEE, E E Eh 


= = 4 710740 37:2-33 
ERES 75 RES, Ej ES ES, Es) 
'This gives 
Ex Et) = qE} Ey Es Ei, + EEE: Ezo (37.2-34a) 
E; Bap EPER, V? E2ES, 1? 
| = [um te ug (37.2-34b) 
Ej Es E, E? El, E, E2 El, 
Z^ = XY%+qYX? where (37.2-34c) 
Z = Ej;Ep, X = EEx, Y = Es Es (37.2-34d) 
By replacing q by q/i in relation |37.2-22| and using identity |37.2-20a| we get 
E, Ef = E? E? Es E? +2qE,El Ei, (37.2-35) 


The same can be done for all a considered so far. For example, the second of the following identities is 
obtained from the first: 


E® E19 ES El E? 
—2 (5 2:78 . 28 16 
n (iq) = +2iq— (37.2-36a) 
EP Eig EP 
E¿EsEl, = ElE?+2qE El Ek (37.2-36b) 


No identities are obtained for a = n(¿q)/n( q?) for (apparently) any odd prime p > 7. We use relation 


31.3-20b] on page [606] for p=: 


E} E}, = F? Ey E? Fog + 2q E; El Er Eg (37.2-37a) 
Z? = X?Y -29Y?X where (37.2-37b) 
Z = E» Esa, X = Ej Ez, Y = E, Eg (37.2-37c) 


37.3 Continued fractions 


A continued fraction is an expression of the form: 


b 
K(a,b) = aot : ; (37.3-1) 
2 
aı + 
1 . bs 
a 
2 Bu 
a3 + 
a4 + ... 
Continued fractions are sometimes expressed in the following form: 


bi b b3 b4 
a1+ d2+ d3+ d44 
The bx and a; are respectively called the k-th partial numerators and denominators. 


K(a,b) = aot (37.3-2) 


For k > 0 let Py /Qx be the value of the above fraction if bg+1 is set to zero (that is, the continued fraction 
terminates at index k). The ratio is called the k-th convergent of the continued fraction: 


Pr b bo b3 b4 bk—ı bk 


— = aot : NS 
Qk dic a2+ az3+ agt ak-1+ ak 


(37.3-3) 


Simultaneous multiplication of a;, bi, bjj1 by some nonzero value does not change the value of the 
continued fraction: we have 
bi ba b3 b4 C1 bi Ca C1 b» C3 C2 b3 C4 C3 b4 


ao + e. = 094 vds (37.3-4) 
01+ 02+ à34- d4+ C101+ C2a2+ C303+ C404+ 


where all c; are arbitrary nonzero constants. 
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37.3.1 Simple continued fractions 


Continued fractions where all b; are equal to 1 (and all the az are positive) are called simple continued 
fractions. Rational numbers have terminating continued fractions. Note that the expression of a rational 
number as simple continued fraction is not unique: 


lao, ..., @n—1, An] = [@o,---; @n-1, Ay — 1,1] ifa, >1 (37.3-5a) 
lao, ..., @n-1, 1] = [ao,.--, @n-1 +1) if an =1 (37.3-5b) 


Solutions of the quadratic equation o z? +8 x+y = 0 that are not rational (A := 8? — 4a y not a square) 
have simple continued fractions that are eventually periodic. For example: 


? contfrac(sqrt(5)) 


[2, 4, 4, 4, 4, 4, ...] 
? contfrac(2*sqrt(3)) 
[3, 1,2, 1,2, 1,2, ...] 
? contfrac(sqrt(19)) 
[4, 2,1,3,1,2,8, 2,1,3,1,2,8, 2,1,3,1,2,8, ...] 


Write Py/Q for the k-th convergent (in lowest terms) of the simple continued fraction expansion of the 
number x. Then the convergent is the best approximation in the following sense: if p/q is any better 


rational approximation to x (that is, A — z| < ES — z), then one must have q > Qk- 
For the simple continued fraction of x we have 


Py 1 1 

QE < Pd 3 
Qn Qn Qua Q 
and equality can only occur with terminating continued fractions. 


Use relation [37.3-4|to convert a continued fraction into a simple continued fraction: 


"M (37.3-6) 


cf2simple(A,B)= 
1 
local(c); c=1; 
for (j-2, #A-1, 
c = 1/(B[jl); 
B[j] *= c; \\ B[j]== 
B[j+1] *= c; 
A[j] *= c; 
); 
\\ note last term of B[] != 1 in general 


return([A,B]); 


Ree 
N= OCUOONDOULWNH 


The terms a; where j > 0 can be set to 1 by the next routine: 


cf2simpleB(A,B)= 
local(c); c=1; 
for (j=2, #B-1, \\ leave a0 as it is 
c = 1/(4[31); 
A[j] *- c; \\ ALjl-- 
B[j] *= c; 
B[j+11 *= c; 
); 
\\ note first and last term of A[] != 1 in general 


Pee 
N= co ONDOUBWNH 


return([A,B]); 


37.3.1.1 Computing the simple continued fraction of a real number 


Compute the simple continued fraction of a real number z as follows: 


procedure number_to_scf(x, n, a[0..n-1]) 


1 


AWN E 


(t k:=0 to n-1 


000 0D01 
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xi := floor (x) 

alk] := xi 

/* if (x-xi)==0 then terminate */ 
x := 1 / (x-xi) 


} 


Here n is the number of requested terms az. Some check has to be inserted to avoid possible division by 
zero (indicating a terminating continued fraction, as will occur for rational x). 


37.3.1.2 Continued fractions of polynomial roots 


= RootO0f(z^3 - 2) == 1.2599210... 


contfrac(r) == [1, 3, 1, 5, 1, 1, 4, 1, 1, 8, 1, 14, 1, 10, 2, ...] 
f-z^3-2 a Cu Edu ET ==> 1 

f-z^3 - 3*z72 - 3*z - 1 3.84732210186307 ==> 3 

f=10*z73 - 6*z^2- 6*z - 1 r=1.18018873554841 ==> 1 

f-3*z^3 - 12*z^2 - 24*z - 10 r=5.54973648578239 ==> 5 

f-bb*z^3 - 81*z72 - 33x*z - 3 r=1.81905335713127 ==> 1 

f-262*z^3 + 30*z^2 - 84*z - 55 r=1. righ ra Paneer ==> 1 

f-AT*z^3 - 162*z^2 - 216*z - 62 r-4.52649103705930 ==> 4 
f-b10*z^3 - 744*z72 - 402*z - 47 r-1.89936756679748 ==> 1 
f-683*z^3 + 360*z^ a - T86*z - 510 r=1.11189244188653 ==> 1 
f-253*z^3 - 1983*z^2 - 2409*z - 683 r-8.93715413784671 ==> 8 
f-17331*z^3 - 14439*z^2 - 4089*z - 253 r=1.06706032616757 ==> 1 
f-1450*z^3 - 19026x*z^2 - 37554*z - 17331 r=14.9119465584038 ==> 14 


Figure 37.3-A: Computation of the continued fraction of the positive real root of the polynomial z? — 2. 


= RootOf(z*2 - 29) == 5.38516480... 
contfrac(r) ==[5, 2, 1, 1, 2, 10, 25; 1,45, 25-10 12] 


f-z^2 - 29 r-5.38516480713450 == 

f- Aa" 2 - 10*z - 1 r-2.59629120178363 ==> 2 

f-b*z^2 - 6*z - 4 ABO AEG ==> 1 

f-b*z^2 - 4*z - b zb 47703296142690 ==> 1 

f-4*z^2 - 6*z - b r= 2.09629120178363 ==> 2 

f-z^2 - 10*z - 4 r=10.3851648071345 ==> 10 

f-4*z^72 - 10*z - 1 r=2.59629120178363 ==> 2 <--= restart period 
f-b*z^2 - 6*z - 4 r=1.67703296142690 ==> 1 


O 00 N O» OMAN NA 


Figure 37.3-B: Computation of the continued fraction of the positive real root of the polynomial z? — 29. 


Let r > 1 be the only real positive root of a polynomial F(x) with integer coefficients and positive leading 
coefficient. Then the (simple) continued fraction [ao,a1,..., an] of r can be computed as follows (taken 


from p.261]): 
1. Set k = 0, Fo(x) = F(a), and d = deg(F). 
2. Find the (unique) real positive root rg of F(x), set az = |r]. If k = n, then stop. 
3. Set G(x) = Fx(x + ap), set Fk41 = —G* (x) = —x4 G(1/z). 
4. Set k = k + 1 and goto step 2. 


A simple demonstration is 


f =z°3 - 2 
ff(y)=subst(f, z, y) \\ for solve() function 
{ for (k=1, 12, 
printi(" =", f); 
r = solve(x-0.9, 1e9, ff(x)); \\ lazy implementation 
printi(" r-", r); 
ak = floor( r ); 
printi(" ==> ". ak); 
g = subst(f, z, z*ak); \\ shifted polynomial 
f = -polrecip( g ); \\ negated reciprocal of g 
print(); 


Ji $ 


E 


Re 
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The output with F(x) = z? — 2 is shown in figure|37.3-A| With quadratic equations one obtains periodic 
continued fractions, figure |37.3-B| shows the computation for F(x) = x? — 29. For a comparison of 
methods for the computation of continued fractions for algebraic numbers see [79]. 


37.3.2 Computation of the convergents (evaluation) 


The computation of the sequence of convergents uses the recurrence 


Pe = ak Pr-—1 + bk Py.» (37.3-7a) 
Qk = akQx-i t+ dx Qk-2 (37.3-7b) 
Set P_1/Q_1 := 1/0 and Po/Qo := ao/1 to initialize. The following procedure computes the sequences 
of values Py and Qp for k = —1...n for a given continued fraction: 
1 procedure ratios_from_cf(a[0..n], b[0..n], n, P[-1..n], Q[-1..n]) 
2 4 
3 P[-1] := 1 
s Q[-1] := 0 
6 P[0] := a[0] 
7 Q[0] := 
j for k:-1 to n 
1 P[k] := alk] * P[k-1] + b[k] * P[k-2] 
2 Q[k] := alk] * Q[k-1] + b[k] * Q[k-2] 
3 
4 } 


A function to compute the numerical value x from the first n terms of a simple continued fraction is 


1 function ratio from cf(a[0..n-1], n) 
2 q 

3 x := a[n-1] 

4 for k:=n-2 to 0 step -1 

5 1 

6 x := 1/x + alk] 

7 

j return x 

0 


With rational arithmetic and a general (non-simple) continued fraction, the algorithm becomes: 


1 function ratio from cf(a[0..n-1], b[O..n-1], n) 
2 

3 P := a[n-1] 

4 Q := b[n-1] 

5 for k:=n-2 to 0 step -1 

6 1 

T { P, Q} :=f{ alk] * P + blk] xQ, P} // x := bik] / x + alk] 
8 

9 

0 return P/Q 

1 Jj 


37.3.2.1 Implementation 


Converting a number to a simple continued fraction can be done with GP’s built-in function contfrac(). 
The final convergent can be computed with contfracpnqn(): 


? default(realprecision,23) 
realprecision - 28 significant digits (23 digits displayed) 


Pi 
3.1415926535897932384626 
? cf-contfrac(Pi) 
[3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3] 
? ?contfracpnqn 
contfracpnqn(x): [p_n,p_{n-1}; q_n,q_{n-1}] corresponding to the continued fraction x. 
? m=contfracpngqn (cf) 
[428224593349304 139755218526789] 
[136308121570117 44485467702853] 
1.0*m[1,1]/m[2,1] 
3.1415926535897932384626 


? 


N 
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The number of terms of the continued fraction depends on the precision used, with greater precision more 
terms can be computed. The computation of the m-th convergent of a continued fraction given as two 
vectors aL] and b[] can be implemented as (backward variant): 


1 cfab2r(a,b, m=-2)= 

2 t 

3 local(n, r); 

4 n = length(a); 

5 if ( m»-2, m = min(n, m) ); NN default: m-n 
6 if ( m=n, m=n-1 ); 

7 if ( m«0, return( O ) ); NN infinity 

8 r=0; 

9 m += 1; 

10 forstep (k=m, 2, -1, r = b[k]/(a[k]*r); ); 
11 r += a[1]; \\ b[1] unused 

12 return( r ); 

13 > 


We can also use the recursion relations |37.3-7a| and |37.3-7b} We do not store all pairs Pa, Qn but only 
return the final pair Pm, Qm: 


1  cfab2pq(a,b,m--2)- 

2 A 

3 local(n, p, pi, p2, q, ql, q2, i); 

4 n = length(a)-1; 

5 if ( m»-2, m = min(n, m) ); NN default: m-n 
6 if ( m«0, return( [1, 07 > ); NN infinity 
fi pi = 1; 

8 qi = 0; 

9 p = ali]; 

10 q = 1; AM b[1] unused 

11 for (k=1, m, 

12 i = k+1; 

13 p2 = pi; pl =p; 

14 q2 = ql; qi= qg; 

15 P a[i]*pi + b[il*p2; 


16 q = alil*qi + b[il*q2; 
); 

18 return( [p,q] ); 

19 7 


We use our routines to compute the convergents of the continued fraction for 4/7 given 1658 by Brouncker: 


4 12 b 
2 emos ds — A E (37.3-8) 


2 A 
ee a ddr 


Figure |37.3-C| shows how to set up the vectors containing the az and b; and check the convergents. 


37.3.2.2 Fast evaluation as matrix product 


For the evaluation of a continued fraction with a large number of terms rewrite relations |37.3-7a| and 


37.3-7b|as a matrix product: 


k k—1 
Pr Qk Jar be} [Pri Qal — aj bi] | far bk ar d 
s qe] 7 [re] Res ees] = M S] = [y v] LLY e] ero 


j=0 


The last equality shows that the next term in the product has to be multiplied at the left. An example, 
compare to figure |37.3-C 


? a=[1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]; 
? b-[1, 1, 9, 25, 49, 81, 121, 169, 225, 289, 361, 441, 529, 625, 729]; 
? m-matid(2);for(n-1,5,m-[a[n] ,b[n] ; 1,0] *m) ;m 
[945 789] 
[105 76] 
? P-m[1,1]; AN == 945 
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default (realprecision, 55); \\ use enough precision 
default (format, "g.11"); NX print with moderate precision 
default(echo, 0); 

\r contfrac.gpi \\ functions cfab2pq and cfab2r 


/* set up the continued fraction: */ 
a-vector(n, j, 2); ali]=1; 
b-vector(n, j, (2*j-3)72); 


/* print convergents and their error: */ 
{ for(k=0, n-1, 

t=cfab2pq(a,b, k); 

p=t [1]; q-t[2]; 


printi(k, ": ",p, "/ "q; 
printi("\n d=", x-p/q); 
print (); 
); Y 
15 /* =n */ 


[1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 21 /* =a */ 
[1, 1, 9, 25, 49, 81, 121, 169, 225, 289, 361, 441, 529, 625, 729] /* =b */ 
1.2732395447 /* = 4/Pi */ 


>: 1/1 

A O 

d=-0.22676045526 
15 / 13 


d-0.11939339088 
105 / 76 /* =p3/q3 */ 
d--0.10833940263 /* =p3/q3-4/Pi */ 
4: 945 / 789 
d=0 .075520913556 
5: 10395 / 7734 
d--0.070825622061 
[--snip-- 
13: 213458046676875 / 163842638377950 
d--0.029583998575 
14: 6190283353629375 / 4964894559637425 
d-0.026428906710 


Figure 37.3-C: A GP script demonstrating the function cfab2pq() which computes the convergents of 
a continued fraction (top) and its output (bottom, comments added). Here convergence is rather slow. 


? Q=m[1,21; NN == 789 
? P/Q 
315/263 
? 4/Pi-P/Q 
0.075520913556 


Use the binary splitting algorithm (section [34.1 on page 651) for the efficient computation of the matrix 
product. 


37.3.3 Determinantal expressions 


The numerators and denominators of successive convergents can be expressed as a determinant: 
P, Q s 
det |, * El = PrQua—PriQe = (D [[ 37.3-10 
Po À dai aS l ) j=1 ’ 


The relation is obtained by taking determinants on both sides of equation |37.3-9| The relation can also 
be written as 


—1 k 
P Pia - CDP" Tyas (37.3-11) 
Qk Qr-ai Qr-1 Qk ` 

For simple continued fractions we have bj = 1 so the product in the numerator equals 1. Further, 


by inserting Py-1 = (Pg — by Pg-2)/ax (relation |37.3-7a) and the equivalent expression for Q,—1 into 


722 Chapter 37: Cyclotomic polynomials, product forms, and continued fractions 


relation |37.3-10} we get 
k-1 
Pr Qk k 
det = BH 2 — Py. , = (-1 b; 37.3-12 
e Fe D k Qk-2 — Pro Qk (—1) ll j ( ) 
Equivalently, 
P, Pu —1)* ap TS b; 
k Pa _ D'a [ja (37.3-13) 
Qk Qk- Qr-2 Qk 


This relation tells us (provided all a; and 6; are positive) that the sequence of even convergents is 
increasing and the sequence of odd convergents is decreasing. As both converge to a common limit 
we have P,/Qo > P./Qe for all even e and odd o. Equality can occur only for terminating continued 
fractions. 


37.3.4 Subsequences of convergents 


Sometimes the terms az, bz, of the continued fraction are given in the form “a, = u(k) if k even, aj = 
v(k) else” (and bg equivalently). Then one may want to compute the x = K(a,b) in a stride-2 manner 
to regularize the involved expressions: 


Py = Ap Pk-2 + By Pra (37.3-14a) 
Qk = AkQk-2 + Bk Qk-4 (37.3-14b) 
We write the recurrence relation three times 

Py = ak Pg-1+0x Pk-2 (37.3-15a) 
Pri = 0-1 Py + by a Pk-3 (37.3-15b) 
Pr-2 = 0x-2Pr-3 + bk-2 Pk—4 (37.3-15c) 

and eliminate the terms P,_; and Pk-3. This gives 
dv Ax bk—1 + bk a2 + Ak Ak—1 G2 NE by i m (37.3-16a) 

Gk—2 Ak—2 
B, = “rbe1bra (37.3-16b) 
Ak-2 


The stride-3 version 
P, = Ap Pe-3 + By Pxr-6 (37.3-17a) 
Qk = AkQr-3 + Be Qr-c (37.3-17b) 
leads to the expressions (writing an for ax, to reduce line width): 


a0b1b3 + boa2b3 + bob2a4 + aoa1a203 + aga102a4 + agb1a3a4 + boaz2a3G4 + a9a1a2a3a4 


Ar 
E b3 + a3a4 


(37.3-18a) 


B, = bob2b3b4 + a9a1b2b3ba (37.3-18b) 
b3 + a344 
When setting az := o, bx :— B the expressions for Az and By simplify to the coefficients in relations 


16c on page 672} (stride-2) and |35.1-16d| (stride-3) for recurrences. 


37.3.5 Relation to alternating series 


With relation |37.3-11/1t is possible to rewrite a continued fraction x = K(a,b) with positive az, bg as an 
alternating series 


b bib, bibab ux Hus 

ao + — 12 do c t... peri Ta +... (37.3-19) 
QoQi QiQ2 Q2Q3 Qk Qk+1 

Thus the algorithm for the accelerated summation of alternating series from section can be applied 

to compute zx. 


zx = 


on 
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37.3.6 Continued fractions for infinite products 


A continued fraction for the product 


P = [[a-«vo (37.3-20a) 
k=0 
in terms of a = [ao, a4, ...] and b = [bo, bi, ...] is 
a = [L 1, XXX) Yo +¥o+(1+%)- Vi, ++(1+%)-Ya, ...] (37.3-20b) 
(1, +Yo -1- Yi -(1+ Yo), —Yo- Yo - (14- Y3), —Yi- Ya - (14- Y), -— (37.3-20c) 


Y(k) = eval(Str("Y" k)) MAN return symbol Yk 
yprod(n)= if (n<=0, 1, prod(j=0, n-1, (1+Y(j)))) 


n=3 
pr = yprod(n) 
((Y2 + 1)*Y1 + (Y2 + 1))*YO + ((Y2 + 1)*Y1 + (Y2 + 1)) 
yv = vector(n, j, Y(j-1 
[YO, Y1, Y2] 
t = cfprod(yv); 
a=t [1] 
[1, 1, (Y1 + 1)*YO + Y1, (Y2 + 1)*Y1 + Y2] 


[1, YO, -Y1i*YO - Yi, (-Y2x*Y1 - Y2)*YO] 
{ for(k=0, n, 

t=cfab2pq(a,b, k); 

p=t [1]; q-t[2]; 


printi(k, "e QA ") / Cu. ro aD Tae 
yp = yprod(k) ; 
printi("\n ==", simplify(p/q)); 
print(); 
0: Y / (1) 
1: (vot 1) / (1) AN (pt) / (q1) 
== YO+ 1 \\ == yprod(1) 
2: ((Y1 + 1)*Y072 + (Yi + 1)*Y0) / (YO \\ (p2) / (q2) 
== (Yl + 1)*YO + (Y1 + 1) \\ == yprod(2) == (1+Y0)*(1+Y1) 


3: (C(Y2 + 1)*Y172 + (Y2 + 1)*Y1)*YO^2 + ((Y2 + 1)*Y1^2 + (Y2 + 1)*Y1)*Y0) / (Yi1*YO) 
== ((Y2 + 1)*Y1 + (Y2 + 1))*YO + ((Y2 + 1)*Y1 + (Y2 + 1)) 


RO O 00-IO:014C0ob-n (00-00 CN 


Figure 37.3-D: Verification of relations |37.3-20b| and |37.3-20c| using GP. 


For a given vector y = [Yo, Yı, ...] the computation of individual values az and by can be implemented 
as: 

cfproda(yv, n)= 

{ 
local( y2, y3, d ); 
if ( n<=1, return(1) ); 
y3 = yvin]; 
y2 = yv[n-1]; 
return( (y2*y3*(1*y2)) ); 

} 

cfprodb(yv, n)= 

{ 
local( y1, y2, y3 ); 
if (0==n, return(1) ); \\ unused 
if (1==n, return(+yv[1+0]) ); 
y3 = yvin]; 
y2 = yv[n-1]; 
yi = if ( n==2, 1, yv[n-2] ); 
return( -y1x*y3*(1+y2) ); 

} 


The routine cfprod() generates the vectors a and b with n +1 terms where n is the length of y: 


CONDO dico -— 


724 Chapter 37: Cyclotomic polynomials, product forms, and continued fractions 


cfprod(yv)= 
{ 
local(n, a, b); 
n = length(yv) ; 
n += 1; AM nti terms in continued fraction 
a = vector(n); 
b = vector(n); 
for (k=0, n-1, 
a[k*1] = cfproda(yv, k); 
b[k*1] = cfprodb(yv, k); 


return( la, b] ); 


Relations |37.3-20b| and |37.3-20c| can be verified using GP as shown in figure |37.3-D 


37.3.7 An expression for a sum of products 


Define Z,, as 
k 
Zn := 2 +21 22 + 21 Z2 Z3 + Z1 Z2 Z3 Z4 +... = l[« 


= zy [14+ 22 [+ z3 [1+ z4 [. . . ]]]] 


Then Zo, has the continued fraction 


Lío = 


22 


23 


24 


1 — 
ped l+2z5—-... 
That is, Zo; = K(a,b) where 

[0, 1, Z9 +1, Z3 4- 1, Z4 4-1, Z5 + 1, ze +1, ier] 


= (1, 21) 225 Z3; Z4, Z5; 265 we! 


(37.3-21a) 


(37.3-21b) 


(37.3-22) 


(37.3-23a) 
(37.3-23b) 


For the n-th convergent P,/Q, one has Qn = 1 and P, = Zn. The corresponding simple continued 


fraction is 


"em o 1 —(z2+1)z (z3+1)z2 —(z4--1l)zazi (25 +1) 2422 


Z1 22 2321 2422 252321 
— (26 +1) 252321 (27 +1) 262422 — (28 +1) 27252321 
262422 i 2725 2321 28262422 ao 


With ay = 0 and a, = 1 for n > 0 we have 
= 29 = 49 — Z4 — 25 
z2+1 (28-1) (22 +1) (z4 + 1)(z3+1) (25 +1) (24 +1) 


b = E 21) 


To convert a hypergeometric series (see chapter [36] on page |685) 


es 
by, b2, ..., dy 


into a continued fraction, set z; = 1 and for k > 1 set 


2 NE II (aj + k) 
kl = THA m qd 
E k II (bj vi k) 


An implementation is 


(37.3-24) 


= | (37.3-25) 


(37.3-26) 


(37.3-27) 


COONDoBWwWMrH 
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1,1 
We convert log(1 — z)/z = F ( A 


NNN 


hyper2cf (va, vb, n, z=’z)= 
\\ convert hypergeom(va,vb,z) into a continued fraction 
1 
local(cfa, cfb, m); 
n += 2; 
cfa = vector(n); 
cfb = vector(n); 
cfa[1] = 0; cfa[2] 
1; cfb[2] 
for (k=3, n, 
m = 1/(k-2); \\ hidden lower parameter 1:  (n-2) == 1*(n-3) 
m *- prod(j=1, iva, va[j]+(k-3)); AN upper parameters 
m /= prod(j=1, #vb, vb[j]+(k-3)); \\ lower parameters 
m *- z; XN argument 
cfa[k]2(m*1); 
cfb[k]-2-m; 


1; 
1 


[2] 
Fh 
o 
mi 
m 
w 
Il 


, 


); 
return( [cfa, cfb] ); 


2) to a continued fraction and check the result: 


N=7; 

va=[1,1] ; vb=[2]; 

t-hyper2cf(va,vb,N); 

cfa=t [1] 

[O, 1, 1/2*z + 1, 2/3*z + 1, 3/4*z + 1, 4/5*z + 1, 5/6*z + 1, 6/7*z + 1, 7/8*z + 1] 
cfb=t [2] 

[1, 1, -1/2*z, -2/3*z, -3/4*z, -4/5*z, -5/6*z, -6/T*z, -7/8*z] 
t=cfab2pq(cfa,cfb) 

[1/8*z^7 + 1/T*z^6 + 1/6*z^b + 1/b*z^4 + 1/4*z73 + 1/3*z^2 + 1/2*z + 1, 1] 
s1=t[11/t[21+0(z"N) 

1 + 1/2xz + 1/3*z^2 + 1/4*z73 + 1/5*z74 + 1/6xz"5 + 1/T*z^6 + O(z^T7) 
s2=hypergeom(va,vb,z,N)+0(z"N) 

1 + 1/2xz + 1/3*z^2 + 1/4*z73 + 1/5*z74 + 1/6*z^b + 1/T*z^6 + O(z77) 


For further information on continued fractions see and . An in-depth treatment is : 


125 


E 
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Chapter 38 


Synthetic Iterations 1 


It is easy to construct arbitrary many iterations that converge super-linearly. Guided by some special 
constants that in base 2 can be obtained by recursive constructions we build iterations that allow the 
computation of the constant in a base independent manner. The iterations lead to functions that typically 
cannot be identified in terms of known (named) functions. Some of the functions can be expressed as 
infinite sums or products. 


38.1 A variation of the iteration for the inverse 


We start with the product form for the simplest iteration, the one for 1/(1 — y): 


I(y) := Ta (38.1-1a) 
= Itytytyt+y*4 (38.1-1b) 
= +y y). Hy). (38.1-1c) 
= (14%) (+Y) (+Y) (+Y)... (14 Yp) -- (38.1-1d) 

where Yo=y, Yrui= Ye 
We now modify the signs in the infinite product: 

Jy) = a-520-3)0-y90-3» ...0-y?)... (38.1-2a) 
= ly +y yyy yg... (38.1-2b) 

ü-X)0-X)0-Y9ü-X)...(1- Y .- (38.1-2c) 


where Yo=y, Yiri = YE 


The value of the n-th coefficient equals +1 if the parity of n is zero, else —1 (sequence /A106400 in [312], 
the Thue-Morse sequence). The function J can be implemented as 


fj(y,N-5)- 
local (ro; 
Ls 
cm (k= 1, N, 
r -= r*y; 
y AE Y 


; 
return(r); 


Co 00-IOXOUuR CON 


Replacing the minus by a plus gives the implementation for the function I. 


A related constant is the parity number (or Prouhet- Thue-Morse constant): 


P = _ 0.4124540336401075977833613682584552830894783744557695575 . . . (38.1-3) 
[base 2] = 0.0110, 1001, 1001, 0110, 1001, 0110, 0110, 1001, 1001, 0110, 0110, 1001,... 
[base 16] = 0.6996, 9669, 9669, 6996, 9669, 6996, 6996, 9669, 9669, 6996, 6996, 9669, . 


[CF] = [0,2,2,2,1,4,3,5,2,1,4,2,1,5,44,1,4,1,2,4,1,1,1,5,14,1,50,15 51114, Delos) 
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1001 

100110010110 

1001100101101001011001101001 

P --» 0110100110010110100101100110100110010110011010010110100110010110 ... 


Figure 38.1-A: Computation of the Thue-Morse sequence by string substitution. 


The sequence of zeros and ones in the binary expansions is entry A010060 in [312]. The constant P can 
be computed defining 


I(y)—-J 
K(y) = El = Mi _ yty? tyt ty Hy? ty" ry? ry ry... (3814) 
We have 
Ij Lis) . O 
== = = 1- 
d ¿E (s) 2 2 3 473 (ou) 
and item 125] 
1 a i 
2-4P = 1(3) =|] (1- sr) = 0.350183865439569608866554526966178...  (38.1-6) 
k=0 


The sequence of bits of the parity number can also be computed by the string substitution shown in 
figure|38.1-A| (which was created with the program [FXT: [ds /stringsubst-demo.cc ). 


The following relations are direct consequences of the definitions of the functions J and J: 


I(y) +1(- 
HEY = 1%) = WICH (38.1-7a) 
J(y?) 
Ily) = 38.1-7b 
(y) I) ( ) 
fia e iE (38.1-7c) 
y — 1 +y * ` 
i+ 
Jy) = = J(y) (38.1-7d) 
We have (from relation [16.4-1a] on page [344]: 
oo " k—1 . 
I(y = 1450] (1 + y”) (38.1-8a) 
k=0 j=0 
oo " k—1 i 
Ja) = 1-5 |p (1 — y? ) (38.1-8b) 
k=0 j=0 
A functional equation for K is 
K) = (l-y) KW’)+—, (38.1-9) 
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It is solved by 


oo 9k k—1 
y j 
K(y) = 5 1— 42^" Il (1 -y ) (38.1-10) 
k=0 y j=0 
For the inverse of J we have 
1 
Ca 1+ y+ 2y? + 2y3 + Ay* + Ag? + 6y® + Gy? + 10y® + 109? + 14919 +... — (38.1-11a) 
= F3 k 
= [n-50-3)-vy90-35..] ^ = I[109?) (38.1-11b) 
k=0 
oo 1 dk y" 
= (1-y) II 17 (38.1-11c) 
k=0 y 
= (1+90+9 (1+ yf 0 t y8)4 (1 + yl95 LL. c y?) ... (38.1-11d) 


Relation |38.1-11d|can be used for a divisionless algorithm for the computation of 1/J: 
binpart (y,N=5)= 
1 


local(r); 
r=1; 
for (k=1, N, 
for (j=1, k, r += r*y; ); 
y*-y 


; 
return(r); 


OSONDAN 


E 


} 
The sequence of coefficients of the even powers of x in relation |38.1-11a|is 
1, 2, 4, 6, 10, 14, 20, 26, 36, 46, 60, 74, 94, 114, 140, 166, 202, 


This is entry |A000123 in [312], the number of binary partitions of the even numbers. The sequence 


z [2, 4, 6, 10,14,...] modulo 2 equals the period-doubling sequence, see section [38.5 on page 734| The 


generating function 1 + 2y + 4y? + 6y* + 10 y* + 14 y? + 20 y8 +... equals 
Iy) _ 2 (1 243 4M (4 8)5 (4 16)6 2*\k+2 12 
Ju = AH AH AHY AHP Aty?) (0-y yh... (38112) 


It can be computed via (note the change in the inner loop) 
binpart2(y,N=5)= 
{ 
local(r); 
r= 3 
for (k=1, N, 
for (j=1, k+1, r += r*y; ); \\1... k+l 
y *= y; 


; 
return(r); 


Co 00-IOCcUuRÉobNSrÓ 


E 


} 


For the function J we have 
S y 
d e a M e (38.1-13a) 
a k= 


Integration gives 


1 1+ y? = k 
—log(1— y) = 5 EH log E =P = y log (1 + y? ) (38.1-14a) 
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For the derivative of J we have 


; 9o ok yal 
Jy) = 0), (38.1-15) 
k=0 
The following functional equations hold for I(y): 
0 = B-2AB+A* where A-—I(y, B=I(y?) (38.1-16a) 
0 = B-2AB-A*+24?B where A=I(-y), B=I(-y’) (38.1-16b) 
0 = B-3AB+3A*B-—A® where A=I(y), B=I(y?) (38.1-16c) 
0 = B-5AB+10A°B-10A°B+5A*B-—A® where (38.1-16d) 


A=I(y), B=1(y) 
0 = B{(1—A)*—(-A)*]+(-A)* where A=I(y), B=I(y*) (38.1-16e) 
The following relation for J(y) can be derived from the functional equation for K(y) (relation |38.1-9), 


the definition of K(y), and relation |38.1-7b 


0 = J$—2J4J9J1--J4J2. where J — J(y), Ja — J(a?), Ja — J(y^)  (38.1-17) 
This relation is given with entry A106400 in [312], together with 
0 = Je Jj —3Je Ja J? +356 J2 Ii — Ja J3 (38.1-18) 


where J, = J(y*). Relations between Jı, Jo, Jk, and Jo, can be derived from relation |38.1-16e| by 
replacing I(y) by J(y?)/J(y). For example, k = 5 gives 


0 = oJ? —5J19 Ja JT +10 Jio J2 J? — 10 Jio J2 J2 +5 Jio J3 Ji — Jg J3 (38.1-19) 
Relations |37.2-20a|. . .|37.2-20d| on page [714] hold when replacing n(x) with J(x) and 7, (x) with I(x), as 
well as relation |37.2-18a|on page 


38.1.1 The Komornik-Loreti constant 
We have K(1/8) = 1 for 


1 

g = 0.5595245581967265251322097651574322858310764789686603076... (38.1-20a) 
[base 2] = 0.1000111100111101000000000110000000001101011000100010110... 

B = 1.787231650182965933013274890337008385337931402961810997...  (38.1-20b) 
[base 2] =  1.1100100110001000000000110110111111101001011101011010000... 
[CF] [1,1,3, 1, 2,3, 188, 1, 12, 1, 1, 22,33, 1, 10, 1, 1, 7, 1, 9, 1, 1, 20,2, 15,1,...] 


The constant f is the smallest real number in the interval (1,2) so that 1 has a unique expansion of the 
form $77 10, B^" where à, € (0,1). It is called the Komornik-Loreti constant (see [8]). We have ôn = 1 
where the Thue-Morse sequence equals 1. This was used for the computation of 3: one solves K(y) = 1 
for y. The transcendence of 8 is proved (using that J(y) is transcendental for algebraic y) in [9]. 


38.1.2 Third order variants 
Variations of the third order iteration for 1/(1 — y) 


I(y) :— pnm = 1+y+y +y ty +... (38.1-21a) 
= (yy) ty +99) (+y? y)... (1+y* yn?) ...  (38.1-21b) 
= (14+Y%+Y) (1+1 +YP) (1 +Y2+Y7) ... (1+Ykr +Yg)--- (38.1-21¢) 


where Yo =y, Yr+1= yY? 
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Start: O 
Rules: 
O --> 001 
1 --> 110 


110110001 
110110001001001110001001110110110001110110001110110001001001110 


Figure 38.1-B: Computation of the Mephisto Waltz sequence via string substitution. 


lead to series related to the base-3 analogue of the parity. The simplest example may be 
Ty) = (+Y YS) +Y- YP) +Y- YF) ... (3 Y - Y2) ... (38.1-22a) 
= byoy ty ti a y y ty ty ty =g ty? +... — (381-220) 


The sign of the n-th coefficient is the parity of the number of twos in the radix-3 expansion of n. We 
have 


1 5 5 
TO = te tei eV MV eV REV Ry ty y" +e: (38.1-23) 
1 1 1 
aH r(3)] =  0.1526445236254075825319249214757916793115045148714892548... (38.1-24) 
[base 2] =  0.0010011100010011101101100010010011100010011101101100011... 
[CF] = [0,6,1,1,4,2,1,1,2,4,1,1,4,1,4,2,1,1,1,2,1, 18,3, 24,1,6,1,3,...] 


The sequence of zeros and ones in the binary expansion is entry A064990 in [812], the Mephisto Waltz 
38.1-B 


sequence. Its computation via string substitution is shown in figure 


38.2 An iteration related to the Thue constant 


TO= 0 

Ti = 111 

T2 = 110110110 
== 3 times 11.0 

T3 = 110110111110110111110110111 
== 3 times 110110.111 

T4 = 110110111110110111110110110110110111110110111110110110110110111110110111110110110 
== 3 times 110110111110110111.110110110 


Figure 38.2-A: Computation of the Thue constant in binary. 


We construct a sequence of zeros and ones that can be generated by starting with a single 0 and repeated 
application of the substitution rules 0 — 111 and 1 — 110. The evolution starting with a single zero is 
shown in figure The crucial observation is that T, = U.U.U where U = T! ,.152 and T; consists 
of the first and second third of Tk. The length of the n-th string is 3". Let T(y) be the function whose 
power series corresponds to the string Too: 


T(y = ltytytyttyi ty ty ry? ry? Hy? Hy Hy Hys Hy ty +... — (382-1) 


It can be computed by the iteration 


Lo = 0, Ap = 145, Bo = y, Yo — y (38.2-2a) 
Rn = Anty? Ln (38.2-2b) 
Enya = An+Bn (38.2-2c) 
Mu = Y (38.2-2d) 
Any = Ball+Yari) —>T(y) (38.2-2e) 
Hid = did (38.2-2f) 
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The implementation is slightly tricky: 


th(y, N=5)= 
{ 
local(L, R, A, B, y2, y3, t); 
/* correct up to order 3^(N*1)-1 */ 
L=0; 
A=i+y; B=y"2; /* R = A.B */ 
for(k=1, N, 
/* (L, A.B) --> (A.B, A.L.A.L . A.L) */ 
y2 = y°2; 
R= A + y2*L; /* A.L */ 
L=A+B; /* next L = A.B */ 
y3 = y * y2; 
B = R * (y3*y3); 
A =R * (1+y3); /* next A = A.L.A.L */ 
y = y3; 


); 
return( A +B ) 


The Thue constant (which should be called Roth’s constant, see entries A014578| and A074071 in [312]) 


can be computed as 


Y 21 

57 (5) =  0.8590997968547031049035725028419742026142399555594390874. . . (38.2-3) 
[base 2] = 0.110, 110,111,110, 110, 111, 110, 110, 110, 110, 110, 111, 110, 110, 111, 110, 110, ... 
[base 8] = 0.667, 667, 666, 667, 667, 666, 667, 667, 667, 667, 667, 666, 667, 667, 666, 667, 667, ... 


[CF] = 0,1,6,10,3,2,513,1,1,2, 1,4, 2,6576668769, 1, 1, 4, 
1,2, 2, 256, 1, 1, 2, 1, 2,3, 1,3, 3, 2417851639229258349412353, 
1, 2,3, 1,3, 2, 1,2, 1,1, 256, 2, 2, 1,4, 2, 3288334384, 
1, 1,4, 1,9) 2, 146, 2,9, 3:9, 1, 9/1, 19,3, ..] 


The term X in the continued fraction has 74 decimal digits. By construction the bits at positions n not 
divisible by 3 are one and otherwise the complement of the bit at position n/3. As a functional equation 


(see also section [38.5 on page 734): 


El 
yTy) +y T’) = (pem (38.2-4) 
From this relation we can obtain a series for T'(y): 
9o yer} 
Ty) = YD” ¡=P (38.2-5) 


38.3 An iteration related to the Golay-Rudin-Shapiro sequence 


We define the function Q(y) by the iteration 


Lo = 1, Ro = Y, Yo = y (38.3-1a) 
Law = La4-R, >Q) (38.3-1b) 
Va = Ye (38.3-1c) 
Ra = Ya (Ln Ra) (38.3-1d) 
The power series for Q(y) is 
Q(y) _ 14 yt y! y i UN y? ye y i y +y? +y” y y” y? | 14 y” + (38.3-2) 
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Number of symbols = 4 


Start: e 
Rules: 
e --> ed 
d --> e2 
2 --> 1d 
1 --> 12 


ede2 

ede2edid 

ede2edidede212e2 
ede2edidede212e2ede2edidi21dedid 


£3: 010 
ORWNRFO 
Wow www 

[o 
à. 


zz ede2edidede212e2ede2edidi21dedidede2edidede212e2121d12e2ede212e2 ... 


MID MA UN 


Figure 38.3-A: Computation of the GRS constant in hexadecimal. 


The sequence of coefficients is the Golay-Rudin-Shapiro sequence (or GRS sequence, entry A020985) in 


[312], see also section [1.16.5] on page [44]. 


We define the Golay-Rudin-Shapiro constant (or GRS-constant) 


Q = 0.9292438695973788532539766447220507644128755395243255222... (38.3-3) 
[base 2] = 0.1110, 1101, 1110, 0010, 1110, 1101, 0001, 1101, 1110, 1101, 1110, 0010, ... 
[base 16] = 0.ede2, edld, ede2, 12e2, ede2, ed1d, 121d, ed1d, ede2, ed1d, ede2, 12e2,... 
[CF] = [0,1,13,7,1,1,15,4,1,3,1,2, 2, 1000, 12, 2,1,6,1,1,1,1,1,1,8,2,1,1,2,4,1,1,3,..] 
as the evaluation 
1+ 5Q(5 
Q = eat (38.3-4) 


An implementation using GP is 
qq(y, N=8)= 
{ 


local(L, R, Lp, Rp); 

/* correct up to order 2**(N+1) */ 

L-1; R=y; 

for(k-0,N, Lp=L+R; y*=y; Rp=y*(L-R); L=Lp; R-Rp); 
y return( L + R ) 


The hexadecimal expansion can also be computed with a string substitution shown in figure|38.3-A|(which 


was created with the program [F XT: |ds/stringsubst-demo.cc|). 


The following functional equations hold for Q: 


Q(y) + Q(—v) 


Q(-y) = Qy’)-yQ(-y’) 
Combining the latter two relations gives 


Qu) = +y) RU) + (1 — y?) Qley”) 


Michael Somos [priv. comm.] gives 


(38.3-5a) 


(38.3-5b) 
(38.3-5c) 


(38.3-6) 


(38.3-7) 
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Counting zeros and ones in the binary expansion of Q 


The number of ones and zeros in the first 4^ bits of the constant Q can be computed as follows: 


1 /* e -» ed ; d-->e2 ; 2--> id ; 1--> 12 k/ 

2 /* e d ; e 2 ; d 1; 2 1; */ 

3 mg= [1, 1, 0, 0; 1, 0, 1, 0; 0, 1, 0, 1; 0,0, 1, 1]; 

4  mg=mattranspose (mg) 

5 { for (k=0, 40, 

6 printi( k, ": "); 

T mm-mg^k; 

8 mv = mm*[1,0,0,0]^; 

9 t = sun(i-1,4, mv[i]); 

10 /* e and d have three ones and one zero */ 

11 /* 1 and 2 have one one and three zeros */ 

12 nO = 3*(mv[3]+mv[4]) + (mv[1]+mv[1]); /* # of zeros */ 

13 ni = 3*(mv[1]+mv[2]) + (mv[3]+mv[4]); /* of ones */ 

14 print( t, " " mv”, 

15 " 40-2", nO, " #1=", ni, " diff-", ni-nO, " #1/#0=", 1.0*ni/nO ); 

16 ) F 

He td #2 #1 

0: 1 [ 1, 0, 05 0] #0= 2 #1= 3 diff= 1 #1/#0=1.5000000 
1i: 2 [ 1, 1, 0, 0] #0= 2 #1= 6 diff= 4 #1/#0=3.0000000 
21 4 [ 2, 1, 1, 0] #0= T #1= 10 diff- 3 #1/#0=1.4285714 
3: 8 [ 3, 3; 15 1] #0= 12 #1= 20 diff= 8 #1/#0=1.6666666 
4: 16 [ 6, 4, 4, 2] #0= 30 #1= 36 diff= 6 #1/#0=1.2000000 
5: 32 [ 10, 10, 6, 6] #0= 56 #1= 72 diff=16 #1/#0=1.2857142 
6: 64 [ 20, 16, 16, 12] #0= 124 #1= 136 diff-12 #1/#0=1.0967741 
T: 128 [ 36, 36, 28, 28] #0= 240 #1= 272 diff=32 #1/#0=1.1333333 
8: 256 [ 72, 64, 64, 56] #0= 504 #1= 528 diff=24 #1/#0=1.0476190 
9: 512 [136, 136, 120, 120] #0= 992 #1=1056 diff=64 #1/#0=1.0645161 
10: 1024 [272, 256, 256, 240] #0=2032 #1=2080 diff=48 #1/#0=1.0236220 


[--snip--] 
40: 1099511627776 [274878431232, 274877906944, 274877906944, 274877382656] \ 
#0=2199022731264 #1=2199024304128 diff=1572864 #1/#0=1.00000071525 


Figure 38.3-B: Number of symbols, zeros and ones with the n-th step of the string substitution engine 
for the GRS sequence. For long strings the ratio of the number of zeros and ones approaches one. 


The data is shown in figure|38.3-B| The sequence of the numbers of ones is entry |A005418 in [312]. It 


is identical to the sequence of numbers of equivalence classes obtained by identifying bit-strings that are 


mutual reverses or complements, see section |3.5.2.5 on page 151 


38.4 Iteration related to the ruler function 


Er 
w 

Won ww 

OOOOOOOo 

herrera 


Figure 38.4-A: Computation of the power series of the ruler function via a (generalized) string substi- 
tution engine. 


The ruler function r(n) is defined to be the highest exponent e so that 2° divides n. Here we consider 
the function that equals r(n) +1 for n 4 0 and zero for n = 0. The partial sequences up to indices 2” — 1 
are shown in TN CES Observe that Rn = R, 4.(R, 1 + [n,0,0,...,0]). The limiting sequence is 
entry A001511|in 312|. Define the function R(y) as the limit of the iteration 


Ri = y, Y = y (38.4-1a) 
Yny = Y? (38.4-1b) 
Rar = Rat Yau [Rn+ (+n) > R(y) (38.4-1c) 


Implementation in GP: 


Ook wor 
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(38.4-2a) 
(38.4-2b) 


(38.4-2c) 


(38.4-3) 


(38.4-4) 


r2(y, N-11)- 
{ /* correct to order = 2^N-1 */ 
local(A); Asy; 
for(k=2, N, y *= y; A += y*(A +k); 5; 
return( A ); 
To compute ry replace the statement A += y*(A + k); by A += y*(A + 1); For the function R we 
have 
1 1 1 
Rea Sg ya(2)+0+0r(-)| 
(=) 2q [ q «YD 
1 1 1 
R|- = R| a= —— 
(3) E gaa. 
y y y 
RO) .— My) = RO) + 5 = 
m leg pw 
and so 
oo Qn 
y 
Ry) = i-um 
n=0 y 
We further have 
oo y^ 
Ry) = 0+) oe 
n=0 


Michael Somos [priv. comm.] gives 


R(y) -3R(?) -2R(y) = 


Rly) — R?) 
Define the ruler constant as R := R(1/2)/2, then 
R = _ 0.7019684139410891602881030370686046772688193807609450337 ... 
[base 2] = 0.10110011101101000011001110110100101100111011010000110011... 
[CF] a (0, 1, 2, 2, 1, 4, 2, 1, 1,1, 2,2, 1,3,3, 4, 5, 6, 1,5,1, 1,9, 49, 1,8,1, 1,5, 1, 


6,5,1,3,3,1,2,4,3,1,2,4,2,1,1,3,1,9, 1,11, 18,2, 4,5, 1,3,2, 25,9, 


2,3,1,2,3,1,9,1,2,8,1,3,4,1,1, 1,1,31, 1,1,6, 1,13, 1,1,14,1,6,1,... 


(38.4-5) 


(38.4-6) 


38.5 An iteration related to the period-doubling sequence 


Start: 0 
Rules: 
O --> 11 
--> 10 

TO = O 
Ti = 11 
T2 = 1010 
T= 10111011 
T4 = 1011101010111010 
TH = 10111010101110111011101010111011 
T62 1011101010111011101110101011101010411 ... 


Figure 38.5-A: Computation of the period-doubling sequence via string substitution. 


38.5: An iteration related to the period-doubling sequence 735 


Define the function T(y) as 


oo gr oo 2n 

y y 
Ty) = 1 Se oe (38.5-1a) 
= va ty a 4 p, yl Hy ty utt erg qu (38.5-1b) 


The function can be computed by the iteration 


Ay = 0 Lı = y, Rı = y, Y =y (38.5-2a) 
Anja = La+YaRa >T(y) (38.5-2b) 
Laud € Det ¥ An (38.5-2c) 
Rai = RatYnAn (38.5-2d) 
Your = Y? (38.5-2e) 


Implementation in GP: 


t2(y, N=11)= 

{ /* correct to order = 2^N-1 */ 
local(A, L, R, t); 
A-0; L-y; R-y; 


return( A ); 


mmnm 
NOR C 00-I1O0»0 ON A 
EP 
*+ ii 


The power series can be computed by starting with a single zero and applying the substitution rules 0 > 11 
and 1 — 10. The evolution is shown in figure Observe that T, = LT, 1).T4 3. R(I 1).T4 2 
where L and R denote the left and right half of their arguments. The limiting sequence is the period- 
doubling sequence. It is entry A035263 in [312] where it is called the first Feigenbaum symbolic sequence. 
Define the period-doubling constant as T :— T(1/2), then 


T = 0.7294270234949484057090662068940526170600269444658547417...  (38.5-3) 
[base 2] =  0.10111010101110111011101010111010101110101011101110111010... 
[CF] = (0, 1, 2, 1, 2, 3, 2,8, 1,1,1,2, 1,8, 6, 1, 2, 1,2,8,1,2,2,1,1, 24, 2, 2, 2, 1, 


649 12.6 1-89 211118 16 2511 8, 
684,21, 2,64, 2,9, 1,1, 94 99,1, 8, 91,9, 1,11, 1,9, 91,116...) 


The transcendence of this constant is proved in [199]. A functional equation for T(y) is 


T(y) +T’) = p: (38.5-4) 


The power series of T(y?) has coefficients one where the series of T(y) has coefficients zero. Michael 
Somos [priv. comm.] gives 


[ry) -TU = BATUTO] (TW?) - T(/£) (38.5-5) 


38.5.1 Connection to the towers of Hanoi puzzle 


The towers of Hanoi puzzle consists of three piles and n disks of different size. In the initial configuration 
all disks are on the leftmost pile, ordered by size (smallest on top). The task is to move all disks to the 
rightmost pile by moving only one disk at a time and never putting a bigger disk on top of a smaller one. 


The puzzle with n disks can be solved in 2” — 1 steps. Figure|38.5-D|shows the solution for n = 4 [FXT: 


bits/hanoi-demo.cc|. Here the piles are represented as binary words. Note that with each move the lowest 


bit in one of the three words is moved to another word where it is again the lowest bit. 
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pile_O pile * pile - moved summary direction 
disk of move 

0: 1111 — bos oe 0000 

1: 111. TN tod vice di 000- 1 
2: 11.. sud s pol sads 00+- 0 
3: dd... ..11 ji ed 00++ 1 
4: T... ..11 zs ous 0-** 1 
5: 1:1 id. T scel 0-*0 1 
6: dud Peso „TT. reds 0--0 0 
T: dass ice .111 Xd 0 = = = 1 
8: -— lods .111 Ll... ==. 30 
9: iva 1:1 sil. sead qol. 1 
10: sls dd NE zd +-0+ 0 
11: ..11 llis Es ned *-00 1 
12: ..11 IL.: EN ides **00 1 
13: vada 11.. sss d Bere ++0- 1 
14: "E 111. sast 1. +++- 0 
15: 1111 . .1 trp- d 


Figure 38.5-B: Solution of the towers of Hanoi puzzle for 4 disks. The rightmost column corresponds 
to the direction of the move, it is the period-doubling sequence. 


For a simple solution, we observe that the disk moved with step k — 1, ..., 2" — 1 corresponds to the 
lowest set bit in the binary representation of k and the index of the untouched pile changes by +1 mod 3 
for n even and —1 mod 3 for n odd. The essential part of the implementation is 


1 void 

2  hanoi(ulong n) 

3 

4 ulong f[3]; 

5 f[0] = first .comb(n); f[1] = 0; f[2] = 0; // Initial configuration 
6 

7 const int dr = (n&i ? -1 : +1); // == +1 (if n even), else == -1 
8 

9 // PRINT configuration 

10 int u; // index of tower untouched in current move 

11 if ( dr<O ) u=2; else u=1; 

B ulong n2 = 1UL<<n; 

14 for (ulong k=1; k<n2; ++k) 

15 { 

16 ulong s = lowest_one(k); 

17 

18 ulong j = 3; while ( j-- ) f[j] ^= s; // change all piles 
19 flu] ^= s; // undo change for untouched pile 
39 u += dr; 
22 if ( u«0 ) u=2; else if ( u»2 ) u-0; // modulo 3 
23 
24 // PRINT configuration 
25 } 
26 > 


Now with each step the transferred disk is moved by +1 or —1 position (modulo 3). The rightmost 
column in figure |38.5-B| consists of zeros and ones corresponding to the direction of the move. It is the 
period-doubling sequence. A recursive algorithm for the towers of Hanoi puzzle is [FXT: 


rec-demo.cc 


1  ulong f[3]; // the three piles 

2 void hanoi(int k, ulong A, ulong B, ulong C) 
3  // Move k disks from pile A to pile C 

4 

5 if ( k==0 ) return; 

6 

7 // 1. move k-1 disks from pile A to pile B: 
8 hanoi(k-1, A, C, B); 

9 

10 // 2. move disk k from pile A to pile C: 
11 ulong b - 1UL «« (k-1); 

12 f[A] ^= b; 

13 f[C] ^= b; 

14 

15 print hanoi(b); // visit state 

16 


17 // 3. move k-1 disks from pile B to pile C: 


18 
19 


ONDT HA UN a 
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hanoi(k-1, B, A, C); 
} 


The piles are represented by the binary words f [A], f [B], and £[C], the variable k is the number of the 
disk moved. The routine is called as follows 

ulong n = 5; 

// Initial configuration: 


f [0] first_comb(n); // n ones as lowest bits 
f[1] = 0; £[2] = 0; // empty 


// visit initial state 


hanoi(n-1, 0, 1, 2); // solve 


More about the computation of the moves is given in [342], an extensive bibliography is [325]. 


38.5.2 Generalizations of the period-doubling sequence 


? default (realprecision, 85) ; 
? tm(y,m,N=8)=sum(n=0,N, (-1)^n* Cy^ (m^n) / (1-y^ (m” (n))))) 
? for(m-2,9,print(m,": ",tm(1.0/10,m,4));print(" ",tm(1.0/10°m,m,4))) 


2i 1, 1 1.1, 111, 111 114; 1.1. 111; 111, 1144 1.1, 111, 111, 111; 1.1, 1314 111, 111, 1,1, 111, Att, uie 
odd. iil... 101...1.1.1...1.1.1-52:4...1. 4 1:1. ede etd. de 

3: 11 11,11111,11,11111,11,11,11,11,11111,11,11111,11,11,11, 11 11111,11,11111,11.11111,1 
.1..1 i..i 11: 1.41.11 dl a es eras ii i..i i. 


4: 111; 111; 114; 1111111, 1 111 1111111, 111; 111, 1111111, 111; 111; 111, 111; 1114 1 1111111,1 
daas... i...1...1 ii:i i...i. Aid 1: 


Figure 38.5-C: Visual verification of the higher order analogues of the period-doubling sequence. In the 
output lines the leading ‘0.’ was removed and all zeros were replaced by dots. 


The functional equation for the period-doubling sequence, relation [38.5-4| can be generalized in several 
ways. For example, one can look for a function for which F3(y) + F3(y?) = y/(1 — y). It is given by 


oo 3^ 
n Y 
k=0 y 
= pag tet het ala ata qoa Ha ule alt ultus, (38.5-6b) 


We can compute the constant 
F3(1/2) =  0.8590997968547031049035725028419742026142399555594390874...  (38.5-7) 


But this is just the Thue constant, see section Large terms occur in the continued 
fraction expansion of this constant. Even greater terms occur in the continued fractions of F, (1/2) for 
m > 3 (replace 3 by m in relation [38.5-6a]. For example, the 45-th term of F5(1/2) has 565 digits. In 
contrast, the greatest of the first 1630 terms of the continued fraction of F2(1/2) = T(1/2) equals 288. 
Some sequences corresponding to the higher order analogues of the period-doubling sequence are shown 
in figure [38.5-C] 

A different way to generalize is to search functions for which, for example, the following functional 
equation holds: 


Fy) +F’) +F) = —— (38.5-8) 


CONDO AUN 
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The equation can be solved by writing F(y) = y/(1 — y) — F(y?) — F(y?) and using recursions that 
terminate when a prescribed order is reached. 


F(z, R)= 
{/* solve F(y) + F(y^2) + F(y^3) = y/(1-y) */ 
local(s, y); 
y=z+R; 
s = y/(1-y); 
if ( y°2!=R, s -= F(z^2, R) ); 
if ( y°3!=R, s -= F(z3, R) ); 


return(s) ; 


} 


To verify that the function F does satisfy the given functional equation, we show the sequences of 


coefficients of the power series of F(y), F(y?), F(y?) and their sum: 
F(y)= [o, 1, 0, O, 1, 1, 1, 1, O, 1, O, 1,71, 1, O, O, 1, 1,71, 1, 1, 0, O, 1, 2, 1, ] 
F(y72)= [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,-1, 0, ] 
F(y73)= [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, O, ] 
sum- [0;. dent 15; 1, 1; 15 1. 1. 1, d. d; d. d, 1, 15 15, 1,1, 1; Ls 1. 1,. 1, 1. 1, ] 


F(y) =ytytty +y rey ty + yo yr + yl y Hy ty pue pu... 


F(y)= «Lesoto dd dto titi iii Led 
F(y72)= ..1..... A A E A A Deets E 1l A n ERa I wes 1.1. 
F(y73)= EE ll. liess 106322 Li das Lands a pdt crops T 

F(y76)= ...... Dee dents eGo ub ER 1.6 ies loi fee 6v bs Tessi redens Tae aiia 
sum= .1111111111111111111111111111111111111111111111111111111111111111111111111 


Figure 38.5-D: Visual demonstration that F(y) satisfies the functional equation F(y) + F(y?) + F(y?)+ 
F(y9) = y/(1 — y). Dots are used for zeros. 


The power series of F(y) where F(y) + F(y?) + F(y?) + F(y9) = y/(1— y) contains only ones and zeros. 
We have 


29 k __1)e2te : — 922 9e 
FQ) = RM) where R(k) = T A ni E (38.5-9) 


= 0 otherwise 


This is a Lambert series, it can be converted into a power series by relation|37.1-25 on page 708| It turns 


out that with k = u 2° 3% (where neither 2 nor 3 divides u), we have 


KEY s { O if either of eg or ez is odd (38.5-10) 


1 otherwise (i.e. both ez and ez are even) 


38.6 An iteration from substitution rules with sign 


Start: 1 
Rules: 
O --> L1 
1 --> 10 
L --> LO 
DO = 1 
Di = 10 
D2 = 10L1 
D3 = 10L1L010 
D4 = 10L1L010L01L10L1 
D5 = 10L1LO10LO1L10L1LO1L10LO10L1L010 


D --» 10L1LO10LO1L10L1LO1L10LO10L1LO10LO1L10LO10L1LO1L10L1LO10LO1L... 


Figure 38.6-A: String substitution used to define the function D(y). 


Let D be the fixed point (limit) of the string substitution shown in figure |38.6-A| Now identify L with 


«O00 -1 O» Ot i C2b2 - 


38.7: Iterations related to the sum of digits 139 


—1 and observe that for n > 1 we have Dn = Dn—1.(—Dn—2).-Dn—2. Define D(y) by the iteration 


Lo = 1, Ro = 1(+0y), Yo = y (38.6-1a) 
La = Ry (38.6-1b) 
Ravi = Ra +Y; (—Ln +Yn Ln) — Dp) (38.6-1c) 
Yp = Y (38.6-1d) 
Implementation in GP: 
dd(y, N=7)= 
{ 
local(R, L, y2, t); 
/* correct up to order 2^(N) */ 
L = 1; 
R = 1 + O*y; 
for(k=1, N, 
/* (L, R) --> (R, R.CL).L) */ 
y2 = y°2; 
t = R; 
R = R + y2*(-L + y*L); /* R.(-L).L */ 
L=t; 
y = y2; 
return( R) 
The power series is 
Diy) = 1— 9? +y — ytty? — y +y — yl! yt yt Hyl O (38.6-2) 


The coefficients are sequence |A029883 in [312] where the following functional equation is given: set 
T; = y D(y), Tz = y? D(y?), and T, = y* D(y*), then 


To — Ta- T? -T2 -2T4T4 = 0 (38.6-3) 


We further have (see relation |38.1-4 on page 727) 


15, 20) = K(y = yvy vy y! ty yn... (38.6-4) 


The coefficients are 1 where the Thue-Morse sequence equals —1. Thus the parity number can be com- 


puted as P — iD (3). A functional equation for D is 


Diy) = y =; (y) + —— (38.6-5) 


SO = 0 

Si = O1 

S2= 0112 

S= 01121223 

S4= 0112122312232334 

$8-»01121223122323341223233423343445b... 
Figure 38.7- A: Generalized string substitution leading to the 1’s-counting sequence. 


'The sequence of the sum of binary digits of the natural numbers starting with zero can be constructed 


as shown in figure 38.7-A| The sequence (entry A000120 in [312]) is called the 1's-counting sequence. 


Observe that Sn = $4. 4.(S4 1 + I4 1) where I, is a sequence of n ones and addition is element-wise. 


COBNOOPWNHH 
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Define the function S(y) by 


l = 1, Ay = Y, Y; = y (38.7-1a) 
2^—1 
lp = dap xy (38.7-1b) 
k=0 
Vag = YE (38.7-1c) 
Anal = An + Yngi (In+1 + An) — S(y) (38.7-1d) 
Implementation in GP: 
s2(y, N=7)= 
1 
local(in, A); 
/* correct to order = 2^N-1 */ 
in = 1; /* 1tyty^2*y^3*...*y^ (2^7k-1) */ 
A= y; 
for(k=2, N, 
in *= yam 
y FF 
A += Gs + A); 
return( A); 
} 
The power series is 
Sly) = 0+y+y +2 + y* + 2y? + 2y® + 3y* +y? + 2y? +24 + By" + 2y? +3y +... (38.7-2) 
Define the sum-of-digits constant as S :— S(1/2)/2, then 
S =  0.5960631721178216794237939258627906454623612384781099326 . . . (38.7-3) 
[base 2] =  0.100110001001011110011000100101101001100010010111100110001 ... 
[CF] = [0,1,1,2,9,1,,3,5, 5,2, 1,1,1, 1,8,2, 1, 1,2, 1, 12, 19,24, 1, 18, 12, 1, . ..] 
The sequence of decimal digits is entry A051158 in [312]. We have (see [324]) 
1 = y? 
S E 38.7-4 
OS ae (38.7-4) 
and also 
oo y" k—1 i 
27 
El [1 + y? (38.7-5) 
k=0 j=0 


The last relation follows from the functional equation for S, 


Sw) = (1+980+ 2 (38.7.6) 


It is of the form F(y) = A(y) F(y?) + B(y) where A(y) = 1+ y and B(y) = y/(1 — y?) and has the 
solution 


oo k—1 . 
Fu) = Y; BW") II [402] (38.7-7) 
k=0 j=0 
This can be seen by applying the functional equation several times: 

F(y) = Aly) F(y’) + B(y) (38.7-8a) 
= A(y) [A ) F(y^) + B(^)] + B(y) (38.7-8b) 
= Aly) [A(y’) [A(y?) a *) + Byy*)] + B(y)] + BY) = ... (38.7-8c) 
= Bly)+ At ) Ba?) + Aly) AG?) B(y^) + .. (38.7-8d) 


& 00-1 O0» Cu uN R- 


38.8: Iterations related to the binary Gray code 741 


Weighted sum of digits 
Define W(y) by 


n 2 z Abd. UM ed (38.7-9a) 
Oe. p ES X 

lai = d. = 2. y (38.7-9b) 

Your = Y (38.7-9c) 

Anu = Aye Ve (nyi +An) Wi (38.7-9d) 


Implementation in GP: 


w2(y, N=7)= 
{ 
local(in, y2, A); 
/* correct to order = 2^N-1 */ 
in = 1/2; /* 1/2^k * (1tyty72+y73+...+y7(2°k-1)) */ 
A = y/2; 
for (k=2, N, 
in *= (1*y)/2; 
y t= 


y; 
A += y*(in + A); 


return( A ); 


} 


In the power series 


1 125,334 4,55 36 T7 
= i 7-1 
W(y) O+ oy 7) up) + sy 3” + gy + gy + (38.7-10) 
18,9 9, 5 40, 18 1, 3,12, llis , T 14,15 15 
167 +76" tie” ^* 16" *169 tie” tie” *1g9 T 


l 16 , lY 17, 9 18,2 19, 9 20, 21 21,13 2, 29 53 
327 " 32% 139% ^" 339 T32% ^" ga" ^" gg" T32 
3 24, 19 ə | 1l o6 , 27 ə T 28 , 28 ə 15 39 | 31 3 1 32 
32% “pl T32% “39 T32% T32% “al T32% Ta 


the coefficient of the y” is the weighted sum of digits w(n) = 772, 27+ b; where bo, b1,. .. is the base-2 
representation of n. The numerator in the n-th coefficient is the reversed binary expansion of n. 


The corresponding weighted sum-of-digits constant or revbin constant is W := W (1/2). Then 


W = 0.4485265506762723789236877212545260976162788135384481336...  (38.7-11) 
[base 2] =  0.0111001011010010101000101101001010001010110100101010001 ... 
[CF] = [0,9,2,2.1,4,18, 1,9,6,5,17,2,14, 1,1, 1,2, 1, 1,9, 1,3, 1, 29,4, 1,...] 


For the function W we have the following functional equation: 
1 1[1-94 1 q 
Wl|-] = W 38.7-12 
a 2 | q tees ! 


38.8 Iterations related to the binary Gray code 


38.8.1 Series where the coefficients are the Gray code of exponents 


We construct a function with power series coefficients that are the binary Gray code of the exponent of 
y. A list of the Gray codes is given below (see section |1.16 on page 41): 


67 8 ,9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
5 4 12 13 15 14 10 11 9 8 24 25 27 26 30 31 29 28 20 21 23 22 18 19 17 16 
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The sequence of Gray codes is entry A003188 in [312]. Define the function G(y) as the limit of the 
iteration 


BR =y Bi=1 =y h=1+y (38.8-1a) 
ys = du (38.8-1b) 
E, = (Fa ) + Yn (B, .1-21,a4) 7 Gy) (38.8-1c) 
Bn = (Fn-1+2In-1)+Yn (Bia ) (38.8-1d) 
In = da Y (38.8-1e) 


Implementation in GP: 
gg(y, N-15)- 
1 


local(t, ii, F, B, Fp, Bp); 
/* correct up to order 2^N-1 */ 
F=0+y; B=1+0; ii=l1ty; 


ii *= 2; /* remove line for sum of digits */ 
Fp = (F ) + y * (B + ii); 

10 Bp = (F + ii) + y * (B ); 

11 F = Fp; B = Bp; 

12 ii *= (1+y); 


00 JO OA NN 


2 
14 j return( F ) 


In the algorithm F contains the approximation so far and B contains the reversed polynomial: 


—k=2: 
(2y^3*3y^2*y) 
(y^2*3y*2) 


—k=3: 

(Ay^7T*by^6*T7y^5*6y^4*2y^3*3y^2*y) 

(y^6*3y^5*2y^446y^3*Ty^2*5y*4) 
24: 


(8y"15+9y"14+11y"13+10y"12+14y"11+15y"10+13y"9+12y"8+4y"7+5y"6+7y"5+6y"4+2y"3+3y"2+y) 
(y"14+3y"13+2y712+6y"11+7y"10+5y"9+4y"8+12y"7+13y"6+15y"5+14y"4+10y"3+11y"2+9y+8) 


w "d na, wy 


We obtain the series 


Gly) = 0+1y+3y? + 2y? + 6y* + 7y? + 5y8 + Ay" 4 (38.8-2) 
412y8 2 13y? + 15y"? + 14y"! ay + 11y" A 8y” + 

+24y** + 25y" + 27y!? + 26y? + 30y% + 319?! + 29y + 28y” + 

| 19? + 17y?? + 16?! +... 


420y^* + 214? + 23y?* + 22427 + 18,7? 
We define the Gray code constant as G := G(1/2): 
G = 2.302218287787689301229333006391310761000431077704369505... (38.8-3) 
[base 2] =  10.01001101010111100010110101111110010011010001111000101 ... 


[CF] = [2,3,3,4,4,1,4,4, 5,2, 1,1, 1,2,24, 205, 1,4,2,2,1, 1,4, 10,8, 1,9, 1, .. .] 


38.8: Iterations related to the binary Gray code 743 


For the function G we have 


Q 
y aam E amm 
NIZA 

lI 


2 


1 u 1 q 1 q 1 
al) i : ze) | 6 ( 3J (38.8-4a) 
1 B 2(q+1) 1 q? 
= G) i q E (=) *(-1)( 1 (38.8-4b) 
= 2 y 
Gy) = 204900) + aan (38.8-4c) 
1 _ 2 (q — 1) 1 e 
il e i q x E (a+) (g^ +1) (38.8-4d) 
1| _ 2(7 +1) Y. @(@+4+4¢q4+1) »" 
1 
q 


» E +1) a srt 49+D € ) vp a ( J Gsi 


The function G(y) can be expressed as 


oo 


1 9k y" 
G = 38.8-5 
(y) Ley 2 Tr ( ) 


38.8.2 Differences of the Gray code 


We define F (y) = (1 — y) G(y) to obtain the power series whose coefficients are the successive differences 
of the Gray code. The coefficients are powers of 2 in magnitude: 


F(y = O+y+2y?—y? + Ay! +y — 2y° y +8 +? +24 — y — Ay? E y? +... (38.8-6) 
We have 
F() = 2F(y’)+7 T " (38.8-T) 
Now, as y/(1 + y) = q/(1 + q) for q = 1/y, 
F(y = r(t) LS 2 (38.8-8) 
y xen Cod 


Thus F(y) can be computed everywhere except on the unit circle. The sum 


9k 


ok Y 
b» C URB (38.8-9) 
k=0 y 


leads to a series with coefficients 
0121412181214121161214121812141212321... 


corresponding to the (exponential) version of the ruler function which is defined as the highest power of 


2 that divides n. The ruler function (see section |38.4 on page 733) 


.0102010301020104010201030102010501... 


is the base-2 logarithm of that series. 
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38.8.3 Sum of Gray code digits 


The sequence of the sum of digits of the Gray code of k > 0 is (entry A005811|in [312]): 
0121232123432321234345432343232123... 


Omit the factor 2 in relations |38.8-1c| and |38.8-1d on page 742| That is, in the implementation simply 


remove the line 


ii *- 2; /* remove line for sum of digits */ 


Let R(y) be the corresponding function and define the sum of Gray code digits constant as R :— R(1/2)/2, 
then 


R = 0.7014723764037345207355955210641332088227989861654212954...  (38.8-10) 
[base 2] =  0.1011001110010011101100011001001110110011100100011011000... 
[CF] = [0,1,2,2,1,6,10,1,9,53, 1, 1,3, 10, 1,2, 1,3,2, 14, 2, 1,2, 1, 3,4,2, 


1,34, 1, 1,3, 1, 1, 109, 1, 1,4, 2,9,1,642,51,4,3,2,2,2,2,1,2,3,.. ] 


G) ine QeuteQD] o0 mem 


One finds: 


1 q+1 (=) q? 
R(-) = R 38.8-11b 
(7) q qa puel 
1 q—1 1 q? 
R[--|] = R 38.8-11 
( ;) q (s) gp qui o) 


R(y) = : Y. : (38.8-12) 


Define the paper-folding constant P as P = (R + 1)/2 


P = 0.8507361882018672603677977605320666044113994930827106477...  (38.8-13) 
[base 2] =  0.1101100111001001110110001100100111011001110010001101100... 
[CF] = /[0,1,5,1,2,321,1,4,107,7,5,2, 12,1, 1,2, 1,6,1,2, 6,1, 1,8, 1, 


17, 9, 1,4, 3, 1,54, 9, 1,1,1,2, 1, 4,2,321, 102, 2,6, 1,4,1,5, 2,...] 


The decimal expansion is sequence A143347 in [312]. The following relation is given in [140] p.440]: 


pas ES (: : ) (38.8-14) 


k ^ 92k*2 
=a 2 2 
The binary expansion of P is the paper-folding sequence, entry A014577 in [312]. A bit-level algorithm 
to generate the sequence is given in section |1.31.3|on page [88] 


38.8.4 Differences of the sum of Gray code digits 
Now define E(y) — (1 — y) R(y) to obtain the differences of the sum of Gray code digits. From this 
definition and relation [38.8-12| we see that 


oo k 


2 
y 
E(y) — 5 kF1 (38.8-15) 
k=0 1+ y 


O oND gi ob - 


38.8: Iterations related to the binary Gray code 745 
L=0 
R=+ 
L = 0+ 
R =+- 
L = 0++- 
R = ++-- 
L = 0++-++-- 
R = +++--+-- 
L = 0++-++--+++--+-- 
R = +++-++---++--+-- 
L = O++-++--+++--+--+++-++---++--+-- 
R = +++-++--+++--+---++-++---++--+-- 
L = O++-++--+++--+--+++-++---++--+--+++-++--+++--+---++-++---++--+-- 
| e o o o o o o 4 
Figure 38.8-A: String substitution for sum of Gray code digits. 
All power series coefficients except for the constant term are +1: 
Ey = 0+yty -y ty ty yty y y y y Hy (38.8-16) 
We have 
1 1 q 
E|- = E|=]+ 38.8-17a 
J (2) @ +1 l ) 
y 
E = E(y 38.8-17b 
@) = EG) (38.8-17b) 
(use SET = SU where q = » for the latter relation), thereby 
1 
E(y = El- (38.8-18) 


Lo = 0, R=1, Y=y 
Lai = Int+YnRn > E(y) 
Rati = (Ln+1)+Yn (Rn - 2) 
Ya = Y 


Implementation in GP: 
ge(y, N=7)= 
{ 


local(L, R, Lp, Rp); 

/* correct up to order 2^N-1 */ 
L=0; R-1; 

for(k-2, N, 


return( L * y*R ) 
} 


The symbolic representations of the polynomials L and R are shown in figure The limit of the 
sequence L is entry A034947 in [312]. It is a signed version of the paper-folding sequence. The sequence 


after the initial zero is identical to the sequence of the (=+) for n = 0,1, 2, 3, ... where ( 


Kronecker symbol, see section [39.8 on page 781| Quick verification: 


a 
b 


) denotes the 


QUe WN e 


&O 00-10» A MN 
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? for(n=1,88,print1(if(-1==kronecker(-1,n),"-","+"))) 
Ho o ttt o o pa e pan o tpn tpn tt o o o do + oo 
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The algorithm is divisionless and fast in practice, we compute R = E(1/2) to more than 600,000 decimal 


digits: 


? 
E 


? 


? 


N=21 ; \\ number of iterations 


? B-2^ 


N-1 \\ precision in bits 


2097151 


D=ceil(B * log(2)/log(10)) + 1 


631307 

default (realprecision,D) ; 
r=ge(1.0/2, N) 
0.701472376403734520735595521064 [... + > 600k digits] 


HA 
4K 


r-ge(1.0/2, N+2) 


last result computed in 2,726 ms. 


\\ checking precison 


-3.7441927084196955405267873193815837984 E-631306 


p=(1it+r)/2 AN paper folding constant 
0.850736188201867260367797760532 [... + > 600k digits] 


38.8.5 Weighted sum of Gray code digits 
Define H(y) by its power series H(y) = Dz, h(k)y* where h(k) is the weighted sum of digits of the 


,n),1/2^(*1),0)); 


\\ precision in decimal digits 


19/32 27/32 11/32 15/32 31/32 23/32 7/32 5/32 21/32 29/32 13/32 9/32 25/32 17/32 1/32 


Gray code: 
wgs(k)- 
local(g,t,s); s=0; g=gray(k); 
for (n=0,33, s+=if (bittest(g 
return(s) ; 
? for(k-1,33, printi(" ",wgs(k))) 
1/2 
3/4 1/4 
3/8 7/8 5/8 1/8 
aes 11/16 15/16 7/16 5/16 13/16 9/16 1/16 
3/32 
3/64 35/64 ... 


An iteration for the computation of H (y) is: 


Fy = y, 
Yn = Yo 
Kn = In-1/2 
Ba cU 
B, = (Fa-1 7 
I, = K, (14 


Implementation in GP: 
gw(y, N=11)= 
{ 


local(t, ii, F, B, Fp, Bp); 


/* correct up to order 2^N-1 */ 


F=0+y; B=1+0; ii=l1ty; 
ii /= 2; F /= 2; B /= 2; 
for (k=2,N, 


y *= y; 
ii /= 2; 
Fp = (F )* y * (B+ 


Bp = (F + ii) + y * (B 
F = Fp; B = Bp; 
ii *= (1+y); 

); 


return( F ) 


ii); 


); 


38.8-20a 
38.8-20b 
38.8-20c 
38.8-20d 
38.8-20e 
38.8-20f 


LA A A RE A ELA 


=~ a 


OWDNDUBWNEH 


38.9: A function encoding the Hilbert curve 


We define the weighted sum of Gray code digits constant as H := H(1/2), then 


H = 0.5337004886392849919588804814821242858549193225456118911 ... 
[base 2] = 0.1000100010100000100110000110000010010000101000000111100... 
[CF] = [0,1,1,6,1,11,4,5,6, 1,13, 1,3,1, 18,5, 77, 1,2,2,3, 1,2, 1, 1,.. ] 

We have: 
p) aet er RU 
H5) = H + H 
(=) q+1 q pol q 
1 1 [q+1 (2) g? 
H(-) = H + 
(=) A q q? -g +q-1 
1 1]|q-1 1 q? | 
HÍ--]) = H 
( ;) Al q (=) @+et+qtl 


38.9 A function encoding the Hilbert curve 


We define a function H(y) by the following iteration: 


H, = +iy+1y?-iy 

Ri = +iy-1y-iy? 

Y = y 
Yny = Y? 
Ana. = iR, +Yn (i Hn +Yn (41+ Hn + Yn (-i+iRn))) 
Rana = +iHn YS (+i + Rn +Yn (—1 + Rn + Yn (—i—iHn))) 
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(38.8-21) 


(38.8-22a) 
(38.8-22b) 


(38.8-22c) 


As the real and imaginary parts are swapped with each step we agree on iterating an even number of 


times. The resulting function H(y) is 


H(y = O+ytiyr—ytiytt+iyr uy iy + yo riy? ty? iy x... 


(38.9-2) 


The coefficients of the series are +1 and +i, except for the constant term which is zero. If the sequence 


of coefficients is interpreted as follows: 


0 := goto start "0? 
+1 :- move right ’>’ 
-1 := move left nee 
+i := move up 29 
-i := move down Ty? 


Then, symbolically: 


H = 0>°<77>v> 7 >yvu<v>>">y>> 7 <TD> TKK VK TT TVD TS TOT KKK << y> VS <y< "><... 


Follow the signs to walk along the Hilbert curve, see figure|1.31-A|on page [84] An implementation is 


hh(y, N=4)= 
{ 
/* correct to order = 4^N-1 */ 
local(H, R, tH, tR); 
H= +I*y + 1*y^2 -I*y^3; R= +I*y - 1*y^2 -I*y^3; 
for(k-2, N, 
y-y^4; 


tH = -I*R + y*(4I + H + y*(+1 + H + y*(-I + I*R))); 
tR = 4I*H + y*(4I + R + y*(-1 + R + y*(-I - I«H))); 
H=tH; R=tR; 


return( H ); 
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The value of H(y) for y < 1 gives the limiting point in the complex plane when the walk according 
to the coefficients is done with decreasing step lengths: step number k has step length y*. The least 
positive y where the real and imaginary part of the endpoint are equal is yı = 0.5436890126920.... 
It turns out that yı is the real solution of the polynomial y? + y? + y — 1. One might suspect that 
M (y) := Re H(y) — Im H (y), the difference between the real and the imaginary part of H(y), has the 
factor y? + y? + y — 1. Indeed we have M(y) = y (y? + y? + y — 1) (y? + y8 + y^ — 1)--- and a similar 
statement is true for the P(y) := Re H(y) + Jm H(y). We use this observation for the construction of a 
simplified and quite elegant algorithm for the computation of H (y). 


38.9.1 A simplified algorithm 


Define the function P(y) as the result of the iteration 


Yi = Y, P, E (38.9-3a) 
Poot = Pa (+1+Yn +Y; - Y) (38.9-3b) 
dun de (38.9-3c) 
P,—1 > P(y) (38.9-3d) 
and the function M(y) by 
Y = y, Mi = 1 (38.9-4a) 
Maa = My (-14+Y,+Y2+Y,) (38.9-4b) 
Var = Xe (38.9-4c) 
yM,  M(y) (38.9-4d) 
Now the function H(y) can be computed as 
1 ; 
H(y = 5 (Pw) + M(y)) +i CP(y) - M(y)) (38.9-5) 
The following implementations compute the series up to order 4^ — 
1 fpp(y, N=4)= 
2 
3 local( t, Y ); 
4 t = 1; Ysy; 
5 for (k=1, N, t *= (+1+Y+Y*2-Y°3); Y-Y^4; ); 
6 return( t-1 ); 
T 
8 
9 fmm(y, N=4)= 
{ 
11 local( t, Y ); 
12 t = 1; Ysy; 
13 for (k=1, N, t*- (-1+Y+Y7"2+Y73); Y=Y^4; ); 
14 return( t*y-Y ); 
} 
17 hhpm(y, N=4)= 
{ 
19 local( tp, tm ); 
20 tp = fpp(y); 
21 tm = fmm(y); 
22 return( ((tp*tm) + I*(tp-tm))/2 ); 
23 y 
With a routine tdir() that prints a power series with coefficients € {—1, 0, +1} symbolically we obtain: 
? N=4; 
? tdir(fpp(y)) ;tdir(fmm(y)) ; 
Q---— E Ao e TERR o TT T bRRRRT HMHH 
Q4+----Ftt-Fttatt atte TT RT o tt tt TT --4----—4RRRT--4---4---- 


? tdir((fpp(y)*fmm(y))/2) ; tdir((fpp(y) -fmm(y))/2); 
Q+0-00+0+0+00-0++0+0++0-0+0--0-000+0++0-0+0--0-0--0+00-0-0-00+0-00 
00+0++0-0+0--0-00+0-00+0+0+00-0+++0-00+0+0+00-0+00-0--0+0-0++0+0++ 
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Number of symbols = 2 
Start: + 


E 


0 

1 

2: (#=16 
3: (#=64) 

7 nm II +++- +++ 


1-256) 


E ERR bRRRRTHRHRTRHRRT TR RRRT RRRT RRRT TT n 


OPER RRRTTITRRBRRT OBRRTRRRT ee e ee o oo ooo == fpp() 


Number of symbols = 2 
Start: + 


0 
1 
2: (#=16) 
3: 
A — e tt tot 


56) 
E AHF FPF tot tt itt tant tant tt pent pot ttn tanta nt ttt] oo, 


O+----+++-+++-+++-++++---+---+----++++---+---+----++++---+---+---- == fmm() 


PER RRR o pa 
(00 == DORA ON OD DOUORAMNN 


Figure 38.9-A: Computation of the power series of the functions P (top) and M (bottom) with a string 
substitution engine. 


The n-th coefficient of the power series of P(y) equals the parity of the number of threes in the radix-4 
representation of n. This can be used for an efficient bit level algorithm, see section|1.31.1 on page 83 


The coefficients of the power series of the functions P and M can be computed with a string substitution 


engine, see figure [38.9-A] 
38.9.2 The turns of the Hilbert curve 


We compute a function with series coefficients € {—1, 0, +1} that correspond to the turns of the Hilbert 
curve. We use +1 for a right turn, —1 for a left turn and zero for no turn. The sequences of turns starts 
as 
0--+0++--++0+--0-++-0--++--0-++00++-0--++--0-++-0--+04++--+4+0+--+ \ 
0++-0--++--0-++0+--+0+4+--++0+--00--+0++--++0+--+0++-0--++--0-4+4+- \ 
-++-0--++--0-++0+--+0++--++0+--00--+0++--++0+--+0++-0--++--0-++0 \ 
+--+0++--++0+--0-++-0--++--0-++00++-0--++--0-++-0--+0++--++0+-- ... 


The computation is slightly tricky: 
hht(y, N=4)= 
t 


/* correct to order = 4^N-1 */ 
local( t, Y, F, s, p 5; 
t = 1; Y=y; p=1; 


F = y+y72; 
for(k=2, N, 
Y = Y^4; 
t = -F + Y*F + Y°2*F - Y^3*F; 
p *= 4; 
if ( 0==(k%2), 
t += y^(1*p-1); 
t += y^°(3*p); 
t -= (y+1)*y^(2*p-1); 
, /* else */ 
t += y^(1*p); 
t += y^(3*p-1); 
); 
F=t; 
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); 
if ( 1==N%2, F 
return( F ); 


-F ); 
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\\ same result for even and odd N 


We give modifications of the iteration for 1/(1 — y) where the power series are sparse. 


38.10.1 A fourth order iteration 


Define the function F(y) as the result of the iteration 


fo = 1, Yo = y 
Fari = FAQT-Y > 
Yai = Y 


We have 


F(y) 


The sequence of exponents is the Moser - De Bruijn sequence, entry A000695 in [812]. Let fto, t1, ta,.. 


be the continued fraction of F(1/q), then 


1y tyt ty? tyli yt y ty P" 


(38.10-1a) 
(38.10-1b) 
(38.10-1c) 


(38.10-2) 


] 


to = 1 (38.10-3a) 
t = q-1 (38.10-3b) 
ij = 1 (38.10-3c) 
tg = -q (38.10-3d) 
tf = +q (38.10-3e) 
t = gd-grg-p (38.10-3f) 
te q +P HEHH (38.10-3g) 
tr = q2 q?! dm q? u ga ES q^ = q? ES q? git (38.10-3h) 
tg q + qt! + g38 + g3? + qe + GE Hg +g! (38.10-3i) 
447 qa + q" q? ge q i q q” 
For j > 4 we have 
t; l 
m = q +q’ = q! (q +1) where J=2* (38.10-4) 
j—2 
A functional equation for F is given by 
1 
HAY = a 7 (38.10-5) 


The relation is (mutatis mutandis) also true for the truncated product. The binary expansions of F (y), 
F(y?) and their product for y = 1/2 are (dots for zeros): 


E 


E 
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Since the expansions are palindromes, their correlation is also a sequence of ones. The decimal expansions 
of the corresponding constants A = F(1/2)/2 and B = F(1/4)/2 (so that A- B = 1/2) start as 


A = 0.7968871593475341797306993394695199454864128606884632333...  (38.10-6a) 

B = 0.6274414063960875864722765982151031494159063845717264801...  (38.10-6b) 

Now define Fẹ := F(y*). Then, by relations |38.10-5| and |38.1-16e on page 729] (for k = 2, 3, and 5), we 
have 

= FiF-2FFjF-MF, (38.10-7a) 

= F? F} — F; Fe (3 F? F} — 3 F, F + 1) (38.10-7b) 

= FPFj- Fio Fs (5 FÍ FS — 10 F? F3 +10 F? F} — 5 Fi Fa + 1) (38.10-7c) 


For power series over GF(2) relation |38.10-5| becomes F(y)? = 1/(1— y). That is, F(y) = 4/1— y. 
In general, an iteration for the inverse (2% — 1)-st root is obtained by replacing relation |38.10-1c| with 
Yk+1 = Y? where e = 2". 


38.10.2 <A different fourth order iteration 
The third order iteration for To) can be implemented as 


inv3m(y, N=6)= /* third order --> 1/(1+y) */ 
{ /* correct to order 3^N */ 

local(T); 

T= 1; 


return( T); 


OONAN =e 
4 
* 
I 
^ 
m 
l 
< 
+ 
< 
N 
YL 


To define the function F(y), we modify the routine to obtain a fourth-order iteration: 


1 £43(y, N=6)= 

2 i /* correct to order 4^N */ 

3 local(T, yt); 

4 T= 1; 

5 for(k=1, N, 

6 T*= (1-yty 2); 

T. y = y^4; /* Note fourth power */ 

8 ; 

9 return( T ); 

0 P 

'That is, 

Fo = 1, Yo = Y (38.10-8a) 

Fa = FQü-Y.-Y2 - Fly) (38.10-8b) 
Yny = E (38.10-8c) 


'The first few terms of the power series are 


F(y = 1-y-y?^- y! c y? — y6 y? —? ty — yl! y — yt y — y+... (38.10-9) 
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Let [to, t1, t2, ...] be the continued fraction of F(1/q), then 


to = 0 (38.10-10a) 
a = 1 (38.10-10b) 
t = q (38.10-10c) 
ts = q (38.10-10d) 
ta = g—q (38.10-10e) 
ts = g@+q-q-1 (38.10-10f) 
te = @-qt+¢-qdt¢g-¢@ (38.10-10g) 
ty = go + g^? m d? + ga + ga _ g? _ q + q? q _ d (38.10-10h) 
For j > 6 we have 
ES = ge? E g^ + g^ + q? = g^ (q? ae 1) (q?7 = qi d 1) (qa + 1) (38.10-11) 


where J = 27-9, The terms of the continued fraction of F(1/q) for integer q grow doubly exponentially: 


? contfrac(f43(0.5)) 
» 1, 2, 2, 2, 21, 180, 92820, 3032435520, 26126907554432455680, 
240254294248527099500117907463345274880, 
164001256750215347067944129734442019102853751066678216639025390799096507 269120, 


/* number of decimal digits of the terms in the CF: */ 
[-, 1, 1, 1, 1, 2, 3, 5, 10, 20, 39, 78, 154, 309, ... ] 


By construction, 


Fly) = (1-y+Y) F (y) (38.10-12a) 
F(-y) = (1+y+y) F (4°) (38.10-12b) 
The equivalent forms with y = 1/q are 
1 Beg 1 
F G) = = 1 F( z) (38.10-13a) 
q q q 
1 Peg ed 1 
F (-<) T F( z) (38.10-13b) 
q q q 
Nowq?—q41-2p5 -p-lifp—1-q,so 
1 Pg n’ 
F (1 = =) = D p ((: = =) ) where q>1 (38.10-14a) 
q q q 
1 2+q+1 [A^ 
F (: + ) = — F ((: + ) where q< —1 (38.10-14b) 
q q q 


Adding relations ax (38.10-13a) and 8x (38.10-14a) and simplifying gives 


a F(y) 4- 8 F(1— y) 


2 
= —y+1 where a,BEC 38.10-15 
FAB YO "c7" Pau) 
38.10.3 A sixth order iteration 
Define the function F(y) by the iteration 
Ki Y-2y (38.10-16a) 
Far = Fa(1+Yn +YF) — Fly) (38.10-16b) 


x (38.10-16c) 


Yn+1 


38.11: An iteration related to the Fibonacci numbers 


753 


Let [to,t1,t2,...] be the continued fraction of F(1/q), then 
t = 1 (38.10-17a) 
> = qui (38.10-17b) 
tp = q+1 (38.10-17c) 
ls = -q (38.10-17d) 
a = —— — M ee (38.10-17e) 
th = d-gq.g-g (38.10-17f) 
te = qt +g uq + q8 t qal o6 + gE go qeu (38.10-17g) 
428 + qY +g 4g + gH a qa qe go qu 
tp = qq) quoq qo gp gu qm (38.10-17h) 
ig = PHTH... HaT Hg (38.10-171) 
For j > 4 we have 
by o [aq -7? if j odd 
Do { (q197 + q97 + q87 A 497 +477) /(q7 +1) otherwise Oe TO] 
where J = 6/74, 
38.11 An iteration related to the Fibonacci numbers 
Rules 
O0 --> 1 
1 --> 10 
A020 
A1 = 1 
A2 = 10 
A3 = 101 
A4 = 10110 
A5 = 10110101 
A6 = 1011010110110 
A7 = 101101011011010110101 
A -->101101011011010110101101101011011010110101... 
Figure 38.11-A: String substitution to compute the rabbit constant. 
The rabbit constant is 
A = 0.7098034428612913146417873994445755970125022057678605169...  (38.11-1) 
[base 2 = 0.1011010110110101101011011010110110101101011011010110101... 
[CF] [0, 1, 2, 2, 4, 8, 32, 256, 8192, 2097152, 17179869184, 


36028797018963968, 618970019642690137449562112, . . .] 
[0,09 0-9 09.09 On UP MS cox B acus] 


'The sequence of zeros and ones after the decimal point in the binary expansion is referred to as rabbit 
sequence or infinite Fibonacci word, entry in [312], the sequence of decimal digits is entry 
'The rabbit sequence can be computed by starting with a single zero and repeated application 
of the following substitution rules: simultaneously replace all zeros by one (0 — 1, 'young rabbit gets 
old’) and all ones by one-zero (1 +> 10, ‘old rabbit gets child’). No sex, no death. The evolution is shown 


in figure |38.11-À 


The crucial observation is that each element A, can be obtained by appending A,-2 to A,-1, that 
is An = An—1.An—2. To compute the value of the rabbit constant in base 2 to N bits precision, the 


Re 
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whole process requires only copying N bits of data is the minimal conceivable work for a (non-sparse) 
computation. 


We define a function A(y) that has the special value 


1 1 
A = ;^(5) (38.11-2) 
by the equivalent operation for power series. The function can be computed by the following iteration: 
Lo = 0, Ro = 1, lg = 1, To = Y (38.11-3a) 
hag = Tn = y (38.11-3b) 
fat = Ted. = yo (38.11-3c) 
Lagi = Rn (38.11-3d) 
Ra = Ratrasiln = Ra+y re La = Bn +y" Ra- 7 Aly)  (38.11-3e) 


Here F, denotes the n-th Fibonacci number (sequence A000045 in [312]): 


n 0 1 2 3 45 6 7 8 9 10 11 12 13 14 15 ... 
Fn O 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610... 


A GP implementation of the iteration is 


1 fay, N=10)= 

2 4 

3 local(t, yl, yr, L, R, Lp, Rp); 

4 /* correct up to order fib(N+2)-1 */ 
5 L-0; R=1; yl=1; yr-y; 

6 for(k-1, N, 

7 t-yr; yr*-yl; yl=t; 

8 Lp-R; Rp=Rtyr*L; L=Lp; R-Rp; 
9 ; 

0 return( R ) 

1 7 


After the n-th step the series in y is correct up to order F,+2— 1. That is, the order of convergence 
equals ys = 1.6180. The function A(y) has the power series 


Ay) = 147 ty ty ty +P ty uy! ty? ty? ty ty? y... (3811-4) 


The sequence of exponents of y in the series is entry A022342)in [312], the Fibonacci-even numbers. The 
Fibonacci-odd numbers are entry A003622 


The following continued fraction for (1 — 1/q) A (1/q) is from [166] p.294]: 


[0, 1, q, q, P, Ë, Č, È, q, oo Sas] (38.11-5) 


38.11.1 Fibonacci representation 


The greedy algorithm to compute the Fibonacci representation (or Zeckendorf representation) of an 
integer repeatedly subtracts the largest Fibonacci number that is greater than or equal to it until the 
number is zero. The Fibonacci representations of the numbers 0...80 are shown in figure|38.11-B 


The sequence of lowest Fibonacci bits (entry A003849 in [312]) is 
0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0, ... 


The string, interpreted as the binary number z = 0.10010100100102... gives the decimal constant x = 
0.5803931.... It turns out that A = 1 — z/2 (that is, x = 2 — A(1/2)). 


The sequence of numbers of digits in the Fibonacci representations (second lowest row in figure|38.11-B) 


is entry A007895 in [312]. This sequence modulo 2 gives the Fibonacci parity (entry A095076). It can be 


computed by initializing Lo = 1 and changing relation |38.11-3e|to 
Bau = Ra—="n+1 Ln = Rp—y™ Ln — Aply) (38.11-6) 


38.11: An iteration related to the Fibonacci numbers 755 


EL d m M d 


m 
pa 
pa 
pa 
pa 
rum 
pa 
um 
pa 
pa 
rum 
pa 
run 
pa 
pa 
pa 
pa 
pa 
pa 
pa 
pa 
HNW 


AE OS POG A Al lid. da de lodos d.d T a dao css ii. 
receded dido decido deo e E E a lidad A E Aa O E A 1-71. 1. 


m 
m 
m 
p 
m 
m 
m 
m 
m 
m 
p 
m 
um 
um 
pm 
. e 
pa 
um 
pa 
pa 
RPNUOT0WRHESTTO 


E 
pm 
un 
N 
arg 
N 
N 
Aa 
N 
N 
N 
w 
hn 
N 
N 
N 
w 
N 
w 
w 
hn 
N 
N 
N 
w 
N 
w 
w 
N 
w 
w 
w 
B 
n 
N 
N 
N 
w 
N 
w 
w 
N 
w 
w 
w 
ÉS 
N 
w 
w 
w 
B 
w 
Ns 
B 
n 
N 
N 
N 
w 
N 
w 
w 
N 
w 
w 
w 
B 
N 
w 
w 
w 
ÉS 
Co 
Ns 
B 
N 
w 
w 
w 
ÉS 
wn 


4111:1:51.211:::1.111.:1:141:111.1.5.1:11.111.:111.1..1.5:.1:11.111..111.1... 111. $42 


Ree RR 
HOSNDLRCOOo-]Oc0uÓotb.KH-— 


Figure 38.11-B: Fibonacci representations of the numbers 0...80. A dot is used for zero. The two 
lower lines are the sum of digits and the sum of digits modulo 2, the Fibonacci? parity. 


Let A,(y) be the corresponding function. We define the Fibonacci parity constant A, as 


Ap = 1-Ap(1/2)/2 (38.11-7a) 
0.9105334708635617638046868867710980073445812290069376454... (38.11-7b) 
0.11101001000110001011100010110111010001011011100111010010... 
[CF] = [0,1,10,5,1,1,1,3,4,2,6,25,4,5,1,1,3,5, 1,3, 2, 1, 1, 1,3, 1,3, 22,1, 

10, 1,2, 3, 2, 73, 1, 111, 46, 1, 51,2, 1,1,5,1,65,3,1,3,2,5,6,1,4,1,2,.. ] 


c 
g 
[^ 
D 

a 

II 


The sequence of the Fibonacci representations interpreted as binary numbers is 


0, 1, 2, 4, 5, 8, 9, 10, 16, 17, 18, 20, 21, 32, 33, 34, 36, 37, 40, 
4i, 42, 64, 65, 66, 68, 69, 72, 73, 74, 80, 81, 82, 84, 85, 128, 129, 


This is entry A003714/in [312], where the numbers are called Fibbinary numbers. Define F2(y) to be the 
function with the same sequence of power series coefficients: 


F(y = O4+1y+2y?+4y2 +5y - 8y? -- 9$ c 10y" +16y% + 17y9 +18y +... (38.11-8) 


A slightly more general function F,(y) (which for b = 2 gives the power series above) can be computed 
by the iteration 


Lo = 0, Ro = Y, lo = y, To = y (38.11-9a) 

Av imm cd. Sym b=2 (38.11-9b) 
Anti = bB, (38.11-9c) 
Bazi = b [|Bn +"n An) (38.11-9d) 
lag. = Tn = go (38.11-9e) 
Tnii = fad. = y"? (38.11-9f) 
Init = Rn (38.11-9g) 
Rar = Rn+"n+1 [Ln + Anya] > Poly) (38.11-9h) 


A GP implementation is 


ffb(y, b=2, N=13)= 
{ /* correct up to order fib(N+3)-1 */ 
local(t, yl, yr, L, R, Lp, Rp, Ri, Li); 
L=0; R=0+1*y; 
Li=1; Ri=1; 
yl-y; yr-y: 
for (k=1, N, 
Li*-b; Rix*=b; 
Lp-Ri; Rp=Ri+yr*Li; Li=Lp; Ri=Rp; 
t=yr; jyr*-yl; yl=t; 
Lp-R; Rp-R*yr*(L4Li); L=Lp; R=Rp; 


return( R ) 
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Let B(x) be the function with power series coefficients equal to 1 if the exponent is a Fibbinary number 
and zero else: 


Biz) := ltetertatta® +r +r? +e +r tal 4 98 45794... (38.11-10) 
Then a functional equation for B(x) is (see entry A003714 in [312]) 
B(r) = 2 B(a*)+ B(z?) (38.11-11) 


We turn the relation into a recursion for the computation of B(x) correct up to the term z^: 


fibbi(z, R)= 


1 

2 4 

3 if ( z+R==0, return( 1+R ) ); 

4 return( z*fibbi(z^4,R) + fibbi(z^2,R) ); 
5 


We check the functional equation: 


? N=30; R=0(x"(N+1)); \\ R is used to truncate terms of order >N 
? t-fibbi(x,R) 
1+x+ x72 + x°4 + x75 + x78 + x79 + x710 + x16 + x17 + x18 + x20 + x21 + 0(x^31) 
? t2-fibbi(x^2,R) 
1 + x^2 + x°4 + x78 + x710 + x^16 + x718 + x^20 + 0(x^31) 
? t4-x*fibbi(x^4,R) 
X + x75 + x79 + x717 + x^21 + 0(x^32) 
? t-(t4-t2) 
0(x^31) 


38.11.2 Digit extract algorithms for the rabbit constant 


The spectrum of a real number z is the sequence of integers |k -x| where k € Ny (the sequence |k. <x] 
where z is irrational is called a Beatty sequence). The spectrum of the golden ratio g = (V5 + 1)/2 ~ 
1.61803 gives the exponents of y where the series for y A(y) has coefficient one: 


bt(x, n-25)- 


local(v); 

v = vector(n); 

for (k=1, n, v[k]-floor(x*k)); 
return ( v ); 


NOOR WNR 


g=(sqrt (5)+1)/2 
1.618033988749894848204586834365638117720309179805762862 


bt(g, n) 
[1, 3, 4, 6, 8, 9, 11, 12, 14, 16, 17, 19, 21, 22, 24, 25, 27, 29, 
30, 32, 33, 35, 37, 38, 40, 42, 43, 45, 46, 48, 50, 51, 53, 55, 56, 
58, 59, 61, 63, 64] 


t-taylor(y*fa(y),y) 
y * y^3 + y^4 * y^6 + y^8 
y^19 * y^21 * y^22 * y^24 5 + y^27 + y^29 + y^30 + y^32 + y^33 + 
y^35 + y^37 + y^38 + y^40 2 + y^43 + y^45 + y^46 + y^48 + y^50 + 
55 + y^56 8 + y^59 + y^61 + y^63 + y^64 + O(y766) 
The sequence [1, 3, 4, 6, ...] of exponents where the coefficient equals 1 is sequence A000201| in [312]. 


There is a digit extract algorithm for the binary expansion of the rabbit constant. We use a binary search 
algorithm: 


y^51 + y^53 + y^ 


bts(x, k)= 
{ /* return 0 if k is not in the spectrum of x, else return index >=1 */ 
local(nlo, nhi, t); 
if ( O--k, return(0) ); 
t = 1 + ceil(k/x); \\ floor(t*x)>=k 
nlo = 1; nhi =t; 
while ( nlo!=nhi, 
t = floor( (nlo+nhi)/2 ); 
if ( floor(t*x) < k, nlo=t+1, nhi=t); 


25 
if ( floor(nhi*x) == k, return(nhi), return (0)); 


mmm 
NRO00 DOHA NA 


E 
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g=(sqrt (5)+1)/2 
for (k=1,65,if (bts(g,k) ,print1("1") ,print1i("0")));print() ; 
10110101101101011010110110101101101011010110110101101011011010110 


The algorithm is very fast, we compute 1000 bits starting from position 1,000,000,000,000: 


g=(sqrt (5)+1)/2 

dd-10^12; /* digits starting at position dd... */ 

for (k=dd,dd+1000,if (bts(g,k) ,printi("1") ,print1("0")));print( ; 
pu a a ak 
--snip-- 
EK last result computed in 236 ms. 


An even faster method for computing individual bits of the sequence proceeds by subtracting the Fibonacci 
numbers > 1 until zero or one is reached. This gives the complement of the rabbit sequence: 


1 fpn=999; 
2  vpv-vector(fpn, j, fibonacci(j+2)); /* vpv=[2,3,5,8,...] */ 
3 t=vpv[length(vpv)]; /* log(t)/log(10)== 208.8471. OK for range up to 5107200 (!) */ 
4 
5 flb(x)= 
6 1 /* return the lowest bit of the Fibonacci representation */ 
7 local(k, t); 
8 k=bsearchgeq(x, vpv); 
9 while ( k>0, 
10 t = vpv[k]; 
11 if (x>=t, x--t); 
12 k-- ); 
13 return (x ); 
14 7 
dd=0; 


for (k=dd,dd+40,t=f1b(k);print1(1-t)) 
10110101101101011010110110101101101011010 
/* 0.10110101101101011010110110101101101011010 rabbit constant */ 
The routine bsearchgeq() does a binary search (see section [3.2 on page 141) for the first element that 
is greater than or equal to the element sought: 


1  bsearchgeq(x, v)- 

2 { /* return index of first element in v[] that is >=x, return O if x>max(v[]) x/ 
3 local(nlo, nhi, t); 

4 nlo = 1; nhi = length(v); 

5 while ( nlo!-nhi, 

6 t = floor( (nlo+nhi)/2 ); 

7 if ( v[t] < x, nlo=t+1, nhi=t); 

8 ); 

9 if ( v[nhi] >= x, return(nhi), return (0)); 

0 P 

We compute the first 1000 bits starting from position 10199: 
dd=107100-1; 


for (k=dd, dd+1000, t=flb(k); print1(1-t)) 
A 
--snip-- 
EK last result computed in 1,305 ms. 


38.12 Iterations related to the Pell numbers 


Start: O 


01110 
01110110110111011011101101101 
11101101101110110111011011011101101110110111011... 


Figure 38.12-A: Evolution for the string substitution rules 0 > 1 and 1 > 110. 


E 
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We use the string substitution shown in figure|38.12-A| The length of the n-th string is 
Dn = 1, 1, 3, 7, 17, 41, 99, 239, ae Dk — 2Pk-1 + Pk-2 


This sequence is entry A001333 in [312], the numerators of the continued fraction of v2. The Pell numbers 
are the first differences (and the denominators of the continued fraction of V2), sequence A000129 


0, 1, 2, 5, 12, 29, 70, 169, 408, 985, 2378, 5741, ... 


Observe that Bn = B4 4.B, 1.B,4 5». Define the function B(y) by the iteration 


Lg = 1, Ro = 1+y, lo = y, TQ = y (38.12-1a) 
laa = Tn (38.12-1b) 
isi = Tala (38.12-1c) 
Inui = Rn (38.12-1d) 
Rai = Ratrngi Rntr24,Ln > Bly) (38.12-1e) 


After the n-th step the series in y is correct up to order p,. That is, the order of convergence is v2 + 1 
~ 2.4142. We implement the function B(y) in GP: 


1 fb(y, N-8)- 
2 4 
3 local(t, yr, yl, L, R, Lp, Rp); 
4 L-1; Relty; yl=y; yr-y; 
5 for(k=1,N, 
6 t-yr; yr*-yr*yl; yl=t; 
7 Lp-R; Rp-R*yr*R*tyr^2*L; L=Lp; R=Rp; 
8 ; 
9 return( R ) 
0 4} 
We obtain the series 

Bly) = i-y- y y! ty y y! ey uy yis. T lS? 4+... (3812.2) 
Define the Pell constant B as B = 3 B(3), then 

B = 0.8582676564610020557922603084333751486649051900835067786 ...  (38.12-3) 
[base 2] =  0.11011011101101110110110111011011101101101110110111011011... 
[CF] = [0,1,6,18, 1032, 16777344, 288230376151842816, 


1393796574908163946345982392042721617379328, . . .] 


The sequence of zeros and ones in the binary expansion is entry A080764 in [312]. For the terms of the 
continued fraction we note 


6 = Pg (38.12-4a) 

18 = 27749! (38.12-4b) 

1032 = 2542 (38.12-4c) 

16777344 = 2242 (38.12-4d) 

288230376151842816 = 27794917 (38.12-4e) 
1393796574908163946345982392042721617379328 = 27704 9% (38.12-4f) 
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38.12.1 Pell palindromes 
Define the function P(y) by 


Lo = 1, Ro = 1+y, lo = y, TQ = y (38.12-5a) 
lag = Tn (38.12-5b) 
taa = rol, (38.12-5c) 
Lau = Rn (38.12-5d) 
Rar = Roa+"Tn+1Ln+"n+1 141 Rn > Ply) (38.12-5e) 


Note that Ro is a palindrome and in relation |38.12-5e| the combination of the parts gives a palindrome. 


For Ro = 1 +y + y? the iteration computes p 


Define the Pell palindromic constant as P — P(1/2)/2, then 


P = _ 0.7321604330635328371645901871773044657272986589604112390... (38.12-6) 
[base 2} = 0.1011101101101110110111011011101101101110110111011011011... 
[CF] = [0,1,2,1,2,1,3,17,1, 7, 2063, 1, 63, 268437503, 1, 8191, 590295810358974087167, 


1, 1073741823, 374144419156711147060143317175958748842277436653567, 1, .. .] 


By construction, the binary expansion is a palindrome up to lengths 1,3, 7, 17,41, 99, 239, . . .. 


Start: O 
Rules: 
0 --> 1 
1 --> 101 
PO = 0 
Pi = 1 
P2 = 101 
P3 = 1011101 
P4 = 10111011011011101 
P5 = 10111011011011101101110110111011011011101 
P --> 1011101101101110110111011011101101101110110111011011011101101... 


Figure 38.12-B: String substitution for the Pell palindromic constant. 


The sequence of zeros and ones in the binary expansion of P is entry A104521 in [312]. It can be computed 
38.12-B 


by the replacement rules 0 > 1 and 1 > 101 shown in figure 


38.12.2 Pell representation 


in: could o: pd 1234961920 L23496 ey 12345689) TUSSROTOUR 


8 
HbesscexunsebP*essda d oy A E ia bide be ca veeee hein EE EA pE i rd dd 239 
eeriécaGaccdntessirqiiiiifiliiilssmsessaslllllidilii1i1111i111i11111113111111111111 — 41 
IMMER EUN iiiiiiiiiiliiillii2222222...........,,.42411111111111111111222222 i7 
ee 1111111222...,,,4.1111111222...,,,5...,,4,41111111222..., 1119191922... 1 
:351112..:1112...... 11121112: iii2:.1112.,.1112....;; 1112::.1112. ; 2 iii 3 
.12.12. .12.12. 12.12.12. .12.12..12.12.12. .12.12. .12.12. 12.12.12. .12. 12. 12.12.12 1 


.12123212323432341232343234345434523434541232343234345434523434543454565456345456 S 


.1212.212.2.1.2.112.2.1.2.1.121.122.1.12112.2.1.2.1.121.122.1.121.1212.212..1212. S%3 
sid edd dd ted Dodd el tee edd td. 9 2 


Figure 38.12-C: Pell representations of the numbers 0...80 (dots for zeros). The three lower lines are 
the sum of digits and the sum of digits modulo 3 and 2. 


To compute the Pell representation of a given number n, set t = n repeatedly subtract from t the largest 
number p, € (1, 3, 7, 17, 41, 99, 239,...) that is not greater than t. Stop when t = 0. The number of 
times that pz has been subtracted gives the k-th digit of the representation. The resulting digits are 0, 
1, or 2. If the k-th digit equals 2, then the (k — 1)-th digit will be zero. 
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The power series of the function S(y) has the sums of Pell digits as coefficients: 


Sly) = 0+1y+2y + 1y? + 2y* + 3? +24 + 1y" + 2y8 + 3y? +24 + 3y +4y +... (3812-7) 
can be computed via the iteration (see section [38.7 on page 739) 
Lo = 0, Ro = 0+y+2y?, lo = y, TQ = y (38.12-8a) 
Ao = 1, Bo = 1+y+y? (38.12-8b) 
Ind) = Tn (38.12-8c) 
taa = Tola (38.12-8d) 
Ln+1 = Ry (38.12-8e) 
Baa = Ratrngi(Rn+ Bn) +7241 (Ln +24n) > S(y) (38.12-8f) 
Anti = Bn (38.12-8g) 
Baa = Bn +rn41 Bs 02,44 An (38.12-8h) 
Implementation in GP: 
1 fs(y, N-8)- 
2 4 
3 local(t, yr, yl, L, R, Lp, Rp, Li, Ri); 
4 L = 0; R = 0+y+2*y72; 
5 Li = 1; Ri = 1+y+y^2; 
6 yl =y; yr=y; 
T for(k=1,N, 
8 t=yr; yr*-yr*yl; yl=t; 
9 Lp-R; Rp=Rtyr*(R+Ri)+yr~2*(L+2*Li); L=Lp; R=Rp; 
10 Lp-Ri; Rp-Rityr*Rityr^2*Li; Li-Lp; Ri-Rp; 
11 n 
12 return( R ) 
13 y 
The series coefficients grow slowly, so the first few of them can nicely be displayed as 
1 
S (5) =  0.1212321232343234123234323434543452343454123234323434543 . .. (38.12-9) 


38.12.3 Pell Gray code 


0123406/890] 2946/090 123406 (090123406 (830 1234561820 1234596/820 1 23456890 123456890 


2 
9 
11111111111111111111111 4 
TLETT eee ae 1 
11112222221111111...... Y 
22111: 22.5: 1112211 1 2 
i: 01221.:1221:....1221..1221.... 


.12321232343212345432345434323432123454345456543234543454565434565432345434323432 S 
.12.212.2.1.212.121.2.121.1.2.1.212.121.1212.21.2.121.1212.21.12.21.2.121.1.2.1.2 8 43 
S121.1.13.1.1.1.1.1.1.1.1.1.1.1.1.43.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1. 9$ 42 


dened ji: Load 1, jt: 1..1.. 3 
11.11.:.11.11.11.11.11.11...11.11.11.11...11.13.11.11...11.11.11.11.11.11...141.11. 1 


Figure 38.12-D: A Gray code for Pell representations. A dot is used for zero. The three fol- 
lowing lines are the sum of digits and the sum of digits modulo 3 and 2. The sequence is 
0,1,2,5,4,3,6,13,10,11,12,9,8, 7, 14..., the difference between successive elements is a Pell number. 
The lowest block gives the Pell representations of the (absolute) differences, the Pell ruler function. 


Figure |38.12-D] gives a Gray code for the Pell representations (see also section|14.6.1|on page|313). The 
Gray code can be constructed recursively as shown in figure|38.12-E] In the algorithm each block is split 
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i2 21.7 
"—— 1111111222 

210420011. 55 

1991. 2:1291 10 
—!( 111111111111111112222222 
Pasco er 1111111222299 111111. os corps 
DUEB CE ANOS 11122111... 1115 
1221: 01001.. 102 91.:192217 ed 1221 5 


Figure 38.12-E: Construction for a Gray code for Pell representations. 


into a left and a right part. The next block is created by appending to the current block its reverse with 
ones on top and appending the left part with twos on top. The iteration can actually be started with a 


block of a single zero (the left part being also a single zero), as in the following algorithm: 


Fo = 0, Fo = 0, Bo = 0, Bj = 0 (38.12-10a) 
lj = 1, fo = 1, Yo = y, YS = (38.12-10b) 
E ME én Dd (38.12-10c) 
I = (Fn )+Yn (Ba +bn In) + Y; (Fh +cr I) — Gely) — (3812-104) 
Bai = (Bi + Cy I) + Y; (Fr + bn In) + Y. Yn (Bn ) (38.12-10e) 
Iny = Int Yn In +Y} I (38.12-10f) 
Yar = YY (38.12-10g) 
UE = FE. Baty = Ba (38.12-10h) 
Ia e na = Y, (38.12-10i) 
----- k= 0 
yl = Q) yr= (y) 
iir = 1 
iil = 1 
F=0 
B=0 
----- k=1 Bf 
yl = (y) yr = (y^3 
iir = (y^°2 + y + 1) 
iil = 1 
F-(2y2*y) 
B = (y + 2) 
----- k=2 pd 
yl = (yi3) yr = (y77) 
iir = (y6 + y5 + y4* y^°3 + y^°2 + y + 1) 
iil = (y72 + y + 1) 
F= (8 y6+4y5+5 y4+6 y^°3 +2 y72 + y) 
B = (y5+2y"4+6y73+5 y"2+ 4 y + 8) 
----- k=3 b-16 
yl = (^?) yr = (y"17) 
iir = (y716 + y715 + y714 + y^l3 + y^l2 + y^ll + y^lO + y^9 
+ y^8 + y^T + y^6 ty 5+ y 4+ y^°3 + y2+ y + 1) 
iil = (y^6 + ye + y^4 + y°3 + y°2 + y +1) 
F = (34 y^16 + 33 y^15 + 32 y^14 
+ 16 y^13 + 17 m + 18 y^li + 22 y710 + 21 y^9 + 20 y^8 + 24 y^7 
+8y6+4y5+ 5b y4+6y73+2y 2+ y) 
B = (y^i5 + 2 y714 + 6 y^13 + 5 y^12 + 4 y^11 + 8 y^10 
+ 24 A DR CR MM QM d ME c 
+ 32 


1 
2 


y 
y°2 + 33 y + 34) 
1 


Figure 38. 


Implementation in GP: 


pgr(y, N=11)= /* Pell Gray code */ 
1 


2-F: Quantities with the computation of the series related to the Pell Gray code. 
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local(iir, iil, yl, yr, Fl, F, Bl, B, b, c); 
local(t, tf, tb); 

/* correct up to order pell(N+1)-1 */ 

F=0; F1=0; B=0; B1=0; 

iil-1; diir=1; yl=y; yr=y; 


for(k=1, N, 
b = 4°(k-1); c = 2xb; /* b = pell(k);x*/ 
tf = (F ) + yr * (B + b*iir) + yr*yr * (Fl + c*iil); 
tb = (Bl + c*iil) + yl * (F + briir) + yl*yr * (B ); 
Fl = F; B1 = B; 
F = tf; B = tb; 
t = iir; iir += yr*(iir + yr*iil); iil = t; 
t = yr; yr *= (yr*yl); yl= t; 


return( F ) 


It is instructive to look at the variables in the first few steps of the iteration, see figure [38.12-F] 
The power series for Ga(y) is 
Go(y) = O+1]y+2y? + 6? + 5y* + Ay? + 8y° + 24y" + 20y® + 21y? + 22y* + 18y" y (38.12-11) 
174 + 16? + 321^ + 33y** + 34418 + 98y*” + 97y* + 961? + 80y” 4 
82y? + 86y + 857^ + 84y? + 88y? + 72y?” + 68y% + 69y” + 70y% 4 


i 
T 
| 
T 


'The coefficients corresponds to the Pell representations interpreted as binary numbers, each Pell-digit 


occupying two bits (figure |38.12-D]. 
If relation |38.12-10c| is changed to b, = P, (indicated in the code, the function can be defined as 
pell(k)=if (k<=1, 1, return(2*pell(k-1)*pell(k-2)));), then the function Gp(y) which has the 
Pell Gray code sequence as coefficients is computed: 

Gp(y) = 0+1y+2y? + 5y? + Ay^ + 3y° + 6y® + 13y" + 10y? + 119? + 12? + 9! + (38.12-12) 
+8y2 + Ty? + 14yl* + 15yP + 16y"? + 33y” + 32y? + 31yP? + 24y” + 25?! + 
+26y7? + 29)? + 28y?* + 27y” + 30y”? + 23?" + 20y% + 21y” + 22y% + 19?! +... 


Section |14.6 on page 313} gives a recursive algorithm to compute the words of the Pell Gray code. 
Define the Pell Gray code constant as 


1 
Gp = Gp (5) (38.12-13a) 
= 2.245567348365072195720956572438998819867495229140192012... (38.12-13b) 
[base 2] =  10.00111110110111011000000001110010001100011000010000111... (38.12-13c) 
[CF] = ][2,4,13,1,5,1, 1,1, 27, 1,9, 1, 3,8, 1, 2, 1, 1, 3, 14, 1, 8, 1, 1, 6,3, 1, 


13,21 7210. 1,123,211: 1011,68 L2 1,2 44,6, 13,1] 
Setting b, = 1 and cn = 2 in the algorithm gives a function whose series coefficients are the sum of Pell 
Gray code digits. The coefficients coincide with the start of the decimal expansion Gp 3(1/10) (until the 
first count > 9 appears): 
1 
Gin] (=) =  .0.1232123234321234543234543432343212345434545654323454345 ... (38.12-14) 
Set bn = 1 and cn = 0 to count the ones in the Pell Gray code: 
1 
Gio] (5) = 0.1012101232121010121232343212321210101210123212123234323... (38.12-15) 


With b,, = 0 and cp = 1 we count the twos: 


1 
Gio,1] (5) = 0.01100110011001122110011001100110011221122112211001100110... (38.12-16) 


Part V 


Algorithms for finite fields 
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Chapter 39 


Modular arithmetic and some 
number theory 


We implement the arithmetical operations modulo m, such as addition, subtraction, multiplication, and 
division. Basic concepts of number theory, like the order of an element, quadratic residues, and primitive 
roots are developed. Selected algorithms such as the Rabin-Miller compositeness test and several primality 
tests are presented. Finally we give the Cayley-Dickson construction for hypercomplex numbers and 
compute their multiplication tables. 


Modular arithmetic and the concepts of number theory are fundamental to many areas like cryptography, 
error correcting codes, and digital signal processing. 


39.1 Implementation of the arithmetic operations 


We implement the basic operations of modular arithmetic: addition, subtraction, multiplication, power- 
ing, inversion and division. The representatives modulo m are chosen to be 0, 1, ..., m — 1. 


39.1.1 Addition and subtraction 
Addition and subtraction modulo m can easily be implemented as [FX T: mod/modarith.h|: 


1 inline umod_t sub mod(umod t a, umod t b, umod t m) 
2 

3 if ( a=b ) return a- b; 

4 else return m - b + a; 

d } 

7 inline umod_t add_mod(umod_t a, umod_t b, umod_t m) 
8 

9 if ( O==b ) return a; 

10 // return sub_mod(a, m-b, m); 

11 b =m- b; 

12 if ( a=b ) return a - b; 

13 else return m - b + a; 

14 } 


The type umod_t is an unsigned 64-bit integer. Care has been taken to avoid any overflow of intermediate 
results. The routines for assignment, increment, decrement and negation are: 


inline umod_t incr_mod(umod_t a, umod_t m) 
{ a++; if ( a==m ) a= 0; return a; } 


inline umod_t decr_mod(umod_t a, umod_t m) 
{ if ( a==0 ) a=m- 1; else a--; return a; } 


inline umod_t neg_mod(umod_t b, umod_t m) 
(if ( 0==b ) return 0; else return m - b; } 


The addition tables for the moduli 13 and 9 are shown in figure 
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O 1 2 3 4 5 6 7 8 9 10 11 12 012 3 4 5 6 7 8 
0 012 34 5 6 7 8 9 10 11 12 0 012 34 56 7 8 
1 1 2 3 4 5 6 78 9101112 0 1 1 2 3 4 5 6 7 8 0 
2 23 4 5 6 7 8 9101112 0 1 2 23 4 5 6 7 8 0 1 
3 3 4 5 6 7 8 9101112 0 1 2 3 3 4 5 6 7 8 0 1 2 
4 4 56 7 8 9101112 0 1 2 3 4 45 6 78 0 1 2 3 
5 5 6 7 8 9101112 0 1 2 3 4 5 5 6 7 8 0 12 3 4 
6 6 7 8 9101112 0 1 2 3 4 5 6 6 7 8 012 3 4 5 
7 7 8 9101112 012 3 4 5 6 7 7 8 0 12 3 4 5 6 
8 8 9101112 0 1 2 3 4 5 6 7 8 8 012 3 4 5 6 7 
9 9101112 012 3 4 5 6 7 8 
10 101112 0 12 3 4 5 6 7 8 9 
11 11 12 1 2 3 4 5 6 7 8 910 
12 12 0 1 3 4 5 6 7 8 9 10 11 


1 
2 
3 
4 


Ree 
WN OD DOHA WHF 


Figure 39.1-A: Addition modulo 13 (left) and modulo 9 (right). 


39.1.2 Multiplication 
Multiplication is a bit harder: with something like 
inline umod_t mul_mod(umod_t a, umod_t b, umod_t m) 


return (a * b) % m; 


the modulus would be restricted to half of the word size. Almost all bits can be used for the modulus with 
the following trick. Let (x),, denote z modulo y, let |x] denote the integer part of x. For 0 < a,b < m 
we have 


a-b = || mt (adh, (39.1-1) 


Rearranging and taking both sides modulo z > m (where z = 2* on a k-bit machine): 


(ao |] m). = (ad), (39.1-2) 


m 


The right side equals (a - b),, because m < z. 


TERRE 913) 


The expression on the right can be translated into a few lines of C-code. For the following implementation 
we require 64-bit integer types int64 (signed) and uint64 (unsigned) and a floating-point type with 64-bit 
mantissa, float64 (typically long double). 


uint64 mul_mod(uint64 a, uint64 b, uint64 m) 
uint64 y = (uint64)((float64)a*(float64)b/m*(float64)1/2); // floor(a*b/m) 
y-y*m; // m*floor(a*b/m) mod z 
uint64 x = a * b; // a*b mod z 
uint64 r = x - y; // a*b mod z - m*floor(a*b/m) mod z 
i ( (int64)r < O ) // normalization needed ? 
r=r+m; 
y^y-1 // (a*b)/m quotient, omit line if not needed 
} 
return rT; // (axb)%m residue 


0M JO MAN 
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The technique uses the fact that integer multiplication computes the least significant bits of the result 
(a - b),, whereas float multiplication computes the most significant bits of the result. The above routine 
works if 0 < a, b < m < 2% = 2. The normalization is not necessary if m < 2% = 


z 


2: 
When working with a fixed modulus the division by p may be replaced by a multiplication with the 
inverse modulus, that only needs to be computed once: 


precompute:  float64 i = (float64)1/m; 
and replace the line uint64 y = (uint64) ((float64) ax (float64) b/m+(float64) 1/2) ; 
by  uint64 y = (uint64) ((float64)a*(float64) b*i+(float64) 1/2) ; 


so any division inside the routine is avoided. Beware that the routine cannot be used for m >= 29?: it very 
rarely fails for moduli of more than 62 bits, due to the additional error when inverting and multiplying 


as compared to dividing alone. An implementation is [FXT: imod/modarith.h|: 


inline umod t mul mod(umod, t a, umod t b, umod t m) 
umod_t x =a * b; 
umod_t y = m * (umod t)( (1double)a * (ldouble)b/m + (1double)1/2 ); 
umod_t r = x - y; 
if ( (smod_t)r <0 ) r +=m; 
return r; 
} 

O 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 
(0) 000 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 OQ 
1 O 1 2 34 5 6 7 8 9 10 11 12 1 0.12 3 4 5 6 7 8 
2 0 2 4 6 81012 1 3 b 7 911 2 0 2 4 6 8 1 3 5 7 
3 0 3 6 912 2 5 6 11 1 4 7 10 3 0 3 6 0 3 6 0 3 6 
4 0 4 812 3 711 2 610 1 5 9 4 0 4 B 3 T 2 6 1 5 
5 O 510 2 712 4 9 1 611 3 8 5 0 5 1 6 2 T 3 8 4 
6 O 612 511 410 3 9 2 8 1 7 6 0 6 3 0 5 3 0 5 3 
7 O 7 18 2 9 310 411 512 6 7 O 7 5 3 1 8 6 4 2 
8 0 8 311 6 19 412 7 210 5 8 08 765 4 3 2 1 
9 0 9 5 110 6 211 7 312 8 4 
10 010 7 4 111 8 5 212 9 6 3 
11 011 9 7 5 3 11210 8 6 4 2 
12 O 12 1110 9 8 7 6 5 4 3 2 1 


OONOOBWNH 


Figure 39.1-B: Multiplication modulo 13 (left) and modulo 9 (right). 


Two multiplication tables for the moduli 13 and 9 are shown in figure|39.1-B| Note that for the modulus 9 
some products a. b are zero though neither of a or b is zero. The tables were computed with the program 


[FX T: mod/modarithtables-demo.cc|. 


For alternative multiplication (and reduction) techniques see ch.14]. One method of great practical 
importance is the Montgomery multiplication described in [252]. 


39.1.3 Exponentiation 


The algorithm used for exponentiation (powering) is the binary exponentiation algorithm shown in sec- 
tion|28.5 on page 563 


inline umod t pow mod(umod t a, umod t e, umod t m) 
// Right-to-left scan 
1 
if ( O==e ) di return 1; > 
else 
umod_t z = a; 
umod_t y = 1; 


while ( 1 ) 
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10 { 

11 if ( e&1 ) y = mul_mod(y, z, m; // y *= 2; 
12 e >>= 1; 

13 if ( O==e ) break; 

14 z = sqr_mod(z, m); // z *= Z; 

15 } 

16 return y; 

17 } 

18 > 


39.1.4 Division and modular inversion 


Subtraction is the inverse of addition. To subtract b from a, we add —b := m — b to a. The element —b 
is the additive inverse of b. Every element has an additive inverse. 


Division is the inverse of multiplication. To divide a by b, we multiply a by b7!, the multiplicative 
inverse of b. However, not all elements have a multiplicative inverse, only those b that are coprime to 
the modulus m (that is, gcd(b, m) = 1). These elements are called invertible modulo m, or units. For a 
prime modulus all elements except zero are invertible. 


The computation of the GCD uses the Euclidean algorithm [FXT: mod/gcd.h|: 


1 template <typename Type» 

2 Type gcd(Type a, Type b) 

3  // Return greatest common divisor of a and b. 
4 (t 

5 if (a«b) swap2(a, b); 
6 if ( b==0 ) return a; 

7 Type r; 

8 do 

9 1 

10 r=a% b; 

11 a=b; 

12 b= r; 

13 } 

14 while ( r!=0 ); 

15 return a; 

16 7 


A variant of the algorithm that avoids most of the (expensive) computations a%b is called the binary 


GCD algorithm |FX T: mod/binarygcd.h|: 


1 template <typename Type» 

2 Type binary ugcd(Type a, Type b) 

3  // Return greatest common divisor of a and b. 
4  // Version for unsigned types. 

5 t 

6 if (a«b) swap2(a, b); 

7 if ( b==0 ) return a; 

8 

9 Type r= af b; 

10 a=b; 

11 b= r; 

12 if ( b==0 ) return a; 

H ulong k = 0; 

15 gue ( !((alb)&1) ) // both even 
16 

1T k++; 

18 a >>= 1; 

19 b >>= 1; 

20 } 

21 

22 while ( !(a&1) ) a >>= 1; 

23 while ( !(b&1) ) b >>= 1; 

24 

25 while ( 1 ) 

26 T 

27 if ( a==b ) return a << K; 
28 

29 if (a«b) swap2(a, b); 
30 Type t = (a-b) >> 1; // t>0 
31 


32 while ( !(t&1) ) t >= 1; 


0 JO Ot Rr. Lr 
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a= t; 
} 
The complexity of this algorithm for N-bit numbers is O(N?), an O(N log(N)) algorithm is given in [322]. 


The least common multiple (LCM) of two numbers is 


a:b a 
lem(a, b) = ET] = (tn) -b (39.1-4) 


The latter form avoids overflow when using integer types of fixed size. 


For modular inversion we can use the extended Euclidean algorithm to compute the extended GCD 
(EGCD), which for two integers a and b finds d = gcd(a,b) and u, v such that au +bv = d. The 
following code implements the EGCD algorithm as given in alg.X, p.342]: 


template «typename Type» 
Type egcd(Type u, Type v, Type &tu1, Type &tu2) 
// Return u3 and set ui,vi so that 
//  gcd(u,v) == u3 == utul + v*u2 
// Type must be a signed type. 
1 
Typeui-1, u2=0; 
Type vi = 0, v3= v; 
Type u3 =u, v2= 1; 


while ( v3!=0 ) 
{ 


Type q = u3 / v3; 


} 
tul = ul; tu2 = u2; 
return u3; 


} 


To invert b modulo m, we must have gcd(b,m) = 1. With the EGCD of b and m we compute u and v 
such that mu + bv = 1. Reduce modulo m to see that bv = 1 (mod m). That is, v is the inverse of b 
modulo m and a/b := ab^! = av. 


Another algorithm for the computation of the modular inversion uses exponentiation. It is given only 


after the concept of the order of an element has been introduced (section 39.5 on page 774). 
39.2 Modular reduction with structured primes 


The modular reduction with Mersenne primes M — 2* — 1 is especially easy: let u and v be in the 
range 0 < u,v < M = 2* — 1, then with the non-reduced product written as uv = 2*r + s (where 
0 X r,s < M = 2* — 1) the reduction is simply uv = r +s (mod M). 


A modular reduction algorithm that uses only shifts, additions and subtractions can also be found for 
structured primes (called generalized Mersenne primes in [315]). Let the modulus M be of the form 


M = Y mir (39.2-1) 
1=0 
where « = 2* and m,, = 1. We further assume that m; = +1 and m,_, = —1 (so that the numbers fit 


into n bits). The reduction algorithm can be found using polynomial arithmetic. Write the non-reduced 
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2764-2732+1 == 2° (32*2)-2” (32*1) +1 2780-2748+1 == 27 (16*5)-2” (16*3) +1 
2796-2732+1 == 2^(32*3)-2^(32*1)*1 2°176-2°48+1 == 2° (16*11)-2” (16*3) +1 
2°224-2°96+1 == 2^(32*7)-2^ (323) +1 2°176-2°80+1 == 2” (16x11)-2” (16*5) +1 
27320-2°288+1 == 2” (32*10)-27 (32*9)+1 2°368-2°336+1 == 2^(16*23)-2^(16*21)*1 
2^512-2^3241 == 2” (32*16)-27 (32*1)+1 2°384-2°80+1 == 2” (16x24)-2” (16*5) +1 
27512-2°288+1 == 2° (32*16)-27 (32*9)+1 27400-2^16041 == 2” (16*25)-27 (16*10)+1 
2^544-2^3241 == 2° (32*17)-27 (32*1) +1 2°528-2°336+1 == 2” (16*33)-27 (16*21)+1 
27 544-2796+1 == 2” (32*17)-27 (32*3) +1 2^544-2^30441 == 2^(16*34)-2^(16*19)-41 
2°576-2°512+1 == 2^(64*9)-2^(64*8)41 2^560-2^11241 == 2” (16*35)-27 (16*7) +1 
2°672-2°192+1 == 2° (32*21)-27 (32*6)+1 2°576-2°240+1 == 2^(16*36)-2^(16*15)-41 
2°832-2°448+1 == 2° (64*13)-27 (64*7)+1 2°672-2°560+1 == 2” (16*42)-2” (16*35) +1 
27992-2^83241 == 2° (32*31)-27 (32*26) +1 2°688-2°96+1 == 2” (16*43)-2” (16*6) +1 
2°71088-2°608+1 == 2” (32*34)-2” (32*19)+1 2°784-2°48+1 == 2^(16*49)-2^(16*3)*1 
27”1184-27768+1 == 2” (32*37)-27 (32*24) +1 2°832-2°432+1 == 2^(16*52)-2^(16*27)*1 
271376-2732+1 == 2” (32*43)-2” (32*1)+1 27880-2^36841 == 2” (16*55)-2” (16*23)+1 
27” 1664-27256+1 == 2” (128+*13)-2” (128+*2)+1 2°912-2°32+1 == 2^(16*57)-2^(16*2)*1 
27” 1856-27” 1056+1 == 2” (32*58)-2” (32*33)+1 27944-2^784*1 == 2” (16*59)-2” (16*49) +1 
271920-2°384+1 == 2” (128+*15)-2” (128*3) +1 271008-27144+1 == 2” (16*63)-2” (16*9) +1 
271984-2^544«1 == 2^(32*62)-2^ (32*17) +1 27”1024-27880+1 == 2” (16*64)-2” (16*55)+1 


Figure 39.2-A: The complete list of primes of the form p = z^ — a) + 1 where z = 206, G=2, G > 32 
and p up to 2048 bits (left) and the equivalent list for x = 2196 and p up to 1024 bits (right). 


? M=x720-x715+x710-x75+1; 
? n-poldegree(M); 
? P-sum(i-0,2*n-1,eval(Str("p "i))*x^i) 
p.39*x^39 + p.38*x^38 + [--snip--] + p_3*x"3 + p 2*x^2 + p_1*x + p. O 


? R=PYM; 

? for(i-0,n-1,print(" ",eval(Str("r_"i))," = ",polcoeff(R,i))) 
r_0 = p.O + (-p_20 - p. 25) 
r_1 = p_1 + (-p.21 - p_26) 
r_2 = p_2 + (-p.22 - p.27) 
r_3 = p.3 + (-p_23 - p_28) 
r_4 = p_4 + (-p_24 - p_29) 
r_5 = p_5 + (p.20 - p.30) 
r_6 = p_6 + (p_21 - p_31) 
r_7 = p_7 + (p.22 - p.32) 
r_8 = p_8 + (p_23 - p_33) 
r.9 = p.9 + (p.24 - p_34) 

r 10-7 p 10 + (-p.20 - p.35) 
r 11-7 p 11 + (-p_21 - p.36) 
r_12 = p_12 + (-p_22 - p_37) 
r_13 = p_13 + (-p_23 - p_38) 
r 14 = p.14 + (-p_24 - p.39) 
r 15 = p.15 + p. 20 
r_16 = p_16 + p_21 
r_17 = p_17 + p.22 
r_18 = p_18 + p_23 
r_19 = p_19 + p_24 


Figure 39.2-B: Computation of the reduction rule for the 640-bit prime Y59(2?2). 
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product P as 


P = Y pir (39.2-2) 


R = riz? := P (mod M) (39.2-3) 


where 0 < r; < x. We determine the reduction rules for moduli of the form z^ — ad + 1 (for k = 3 and 
j = 2, the rules are the last three lines): 


? k-3;j-2; 
7 M=x"k-x"3+1 
x3-x2*-*1 
? n-poldegree(M); 
? P=sum(i=0,2*n-1,eval(Str("p_"i))*x*i) \\ unreduced product 


p.5*x^5 + p 4*x^4 + p_3*x"3 + p 2*x^/2 + p_1*x + p_0 
? R=P%M; \\ reduced product 
? for(i=0,n-1,print(" ",eval(Str("r "i))," = ",polcoeff(R,i))) 
r_0 = p_O + (-p_3 + (-p_4 - p_5)) 
r_1 = p_1 + (-p_4 - p_5) 
r2- 2+ (p_3 + p_4) 


A list of primes of the form p = x*—x/+1 where x = 2€, G a power of 2 and G > 16 is shown in figure 


The equivalent list with i a multiple of 8 is given in [FXT: data/structured-primes-2k2jl.txt|. The 


primes allow radix-2 number theoretic transforms up to a length of 27. 

Structured primes that are evaluations of cyclotomic polynomials are given in section |39.11.4.7 on 
page 802| The reduction rule for the 640-bit prime M = Y39(2?2) is shown in figure |39.2-B| There 
is a choice for the ‘granularity’ of the rule: the modulus also equals Y19(2*??), so we can obtain the 
reduction rule for groups of five 32-bit words 


? M-x^4-x^3*x^2-x^1*1; 


[--snip--] 
? for(i=0,n-1,print(" ",eval(Str("r "i))," = ",polcoeff(R,i))) 
r_0 = p_O + (-p_4 - p_5) 
r_i = p_1 + (p_4 - p_6) 
r_2 = p_2 + (-p_4 - p_7) 
r_3 = p_3 + p_4 


The rule in terms of single words seems to be more appropriate as it allows for easier code generation. 


39.3 The sieve of Eratosthenes 


Several number theoretic algorithms can take advantage of a precomputed list of primes. A simple and 
quite efficient algorithm, called the sieve of Eratosthenes computes all primes up to a given limit. It uses 
a tag-array where all entries > 2 are initially marked as potential primes. The algorithm proceeds by 
searching for the next marked entry and removing all multiples of it. 


An implementation that uses the bitarray class (see section |4.6 on page 164) is given in [FXT: 
mod/eratosthenes-demo.cc|: 


1 void 

2  eratosthenes(bitarray &ba) 

3., { 

4 ba.set_all(); 

5 ba.clear(0); 

6 ba.clear(1); 

7 ulong n = ba.n_; 

8 ulong k = 0; 

9 while ( (k=ba.next_set(k+1)) <n ) 
10 

11 for (ulong j=2, i-j*k; i<n; ++j, i-j*k) ba.clear(i); 
12 } 
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The program prints the resulting list of primes (code slightly simplified): 
int 
main(int argc, char **argv) 
{ 
ulong n = 100; 
bitarray ba(n); 
eratosthenes (ba) ; 
ulong k = 0; 
ulong ct = 0; 
while ( (k=ba.next_set(k+1)) <n ) 


++ct; 
cout << " " << k; 
} 
cout << endl; 
cout << "Found " << ct << " primes below "<< n << "." << endl; 
return 0; 


} 
The output is: 

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 25 primes below 100. 
A little thought leads to a faster variant: when removing the multiples k -p of the prime p from the list we 
only need to care about the values of k that are greater than all primes found so far. Further, values k - p 
containing only prime factors less than p have already been removed. That is, we only need to remove the 
values (p?, p? + p, p? + 2p, p? + 3p, ...). This algorithmic improvement can be deduced from the series 
acceleration of the Lambert series given as relation|37.1-27|on page If we further extract the loop for 
the prime 2, then for the odd primes, we need to remove only the values (p?, p? -2p, p? +4p, p? -- 6p, ...}. 


The implementation is [FXT: mod/eratosthenes-demo.cc]: 


1 void 

2  eratosthenes opt(bitarray &ba) 

3 4 

4 ba.set allO; 

5 ba.clear(0); 

6 ba.clear(1); 

7 ulong n = ba.n_; 

8 for (ulong k-4; k<n; k+=2) ba.clear(k); 
9 ulong r = isqrt(n); 

10 ulong k = 0; 

11 "E ( (k=ba.next_set(k+1)) <n ) 

13 if (k»r) break; 

14 for (ulong j-k*k; j<n; j*-k*2) ba.clear(j); 
15 } 

16 $ 


When computing the primes up to a limit N, about N/p values are removed after finding the prime p. If 
we slightly overestimate the computational work W by 


1 
WaN M = (39.3-1) 


p<N, prime 


then we have W = N log(log(N)), which is almost linear. Practically, much of the time used with greater 
values of N is spent waiting for memory access. Therefore further improvements should rather address 
machine-specific optimizations than additional algorithmic refinements. 


We can save half of the space by recording only the odd primes. A C++ implementation of the modified 
algorithm is [FXT: 


bitarray * 
make_oddprime_bitarray(ulong n, bitarray *ba/*=0*/) 


if ( O!=ba ) delete ba; 
ba = new bitarray( n/2 ); 


NO OTR C2 B2 — 


ba-»set allO; 
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8 const ulong m = ba->n_; 

9 

10 ba->clear (0) ; 

11 ulong r = isqrt(n); 

12 ulong i = 3; 

13 ulong ih = i/2; 

14 while ( i <=r ) 

15 

16 if ( ba->test( ih ) ) 

17 

18 for (ulong kh-(i*i)/2; kh<m; kh+=i) ba->clear( kh ); 
19 } 

20 ih = ba->next_set( ih*1 ); 
21 i = 2*ih + 1; 

22 } 

31 return ba; 

25 4} 


The corresponding table is created at the startup of programs linking the FXT-library. Now we can verify 
the primality of small numbers [FXT: : 

1 bool 

2 is small prime(ulong n, const bitarray *ba/*=0*/) 

3 // Return if n is a small prime. 

4  // Return false if table of primes is not big enough. 

5 t 

6 if ( O==(ng%1) ) return (2==n); // n even: 2 is prime, else composite 
7 if ( n<=1 ) return 0; // zero or one 

8 

9 


if ( O--ba ) ba = oddprime bitarray; 


10 ulong nh = n/2; 
11 if ( nh >= ba->n_ ) return false; 
12 return ba->test( nh ); 
13 7 
The data can also be used to compute the next prime greater than or equal to a given value: 
1 ulong 
2 next_small_prime(ulong n, const bitarray *ba/*=0*/) 
3 // Return next prime >= n. 
4 // Return zero if table of primes is not big enough. 
5 t 
6 if ( n«22) return 2; 
7 
8 if ( O--ba ) ba = oddprime bitarray; 
9 ulong nh = n/2; 
10 
11 n = ba-»next set( nh ); 
12 if ( n==(ba->n_) ) return 0; 
13 return 2* n + 1; 
14 7 


39.4 The Chinese Remainder Theorem (CRT) 


Let mi, m», ..., my be pair-wise coprime (that is, ged(m;, mj) = 1 for alli Z j). If x = x; (mod m;) for 
i= 1, 2, ..., f then z is unique modulo the product M = mı «ma ---mpf. This is the Chinese remainder 
theorem (CRT). Note that it is not assumed that any of the m, is prime. 


The theorem tells us that a computation modulo a composite number M can be split into separate 
computations modulo the coprime factors of M. To evaluate a function y := F(x) mod M where M = 


mi: mz: ...- mg (with ged(m;,m,;) = 1 for all i Z j), proceed as follows 
1. Splitting: compute xı = x mod m4, £2 =x mod ma, ..., zy =x mod my. 
2. Separate computations: compute y; :— F(a) mod mı, y2 := F(#2) mod ma, ..., F(vy) = 


F (xf) mod mj. 
3. Recombination: compute y from yi, Ya, ..., yf using the CRT. 


For example, when computing the exact convolution of a long sequence via number theoretic transforms 
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(see section [26.3] on page[542) the largest term of the result must be less than the modulus. Assume that 
(efficient) modular arithmetic is available for moduli of at most word size. Now choose several coprime 
moduli that fit into a word and whose product M is greater than the largest element of the result. 
Compute the transforms separately and only at the very end compute, via the CRT, the result modulo 
M. Note that only the result needs to be less than M, we do not need to worry about intermediate 
quantities. 


39.4.1 Efficient computation 


For two moduli m4, mz compute x with x = xı (mod mı) and x = x2 (mod ma) as suggested by the 
following pseudocode: 
function crt2(x1, mi, x2, m2) 
c := mi**(-1) mod m2 // inverse of m1 modulo m2 
S := ((x2-x1)*c) mod m2 


return xí + s * ml 


CON OTE WN e 


For repeated CRT calculations with the same modulus one should precompute and store c = mi? mod ma. 
With more than two moduli use the above algorithm repeatedly. As pseudocode: 


1 function crt(x[0,...,f-1], m[0,...,f-1], £f) 
2 1 
3 xi := x[0] 
4 mi := m[0] 
2 i := 
do 
T { 
8 x2 := x[i] 
9 m2 := m[i] 
10 x1 := crt2(x1, mi, x2, m2) 
H mi := m1 * m2 
i:s i 1 
13 
H while i < f 
return x 
16 $ 
A C++ implementation is given in [FXT:|mod/chinese.cc : 
1 umod_t 
2 chinese(const umod_t *x, const factorization &f) 
3 // Return R modulo M where: 
4 //  f[] is the factorization of M, 
5 //  x[] := R modulo the prime powers of f[]. 
6 4 
7 const int n = f.nprimes(); 
8 // (omitted test that gcd(m_0,...,m_{n-1})=1 ) 
9 
10 const umod t M = f.product(); 
11 umod_t R = 0; 
12 for (int i=0; i<n; ++i) 
13 
14 // Ti = prod(mk) (where k!-i); Ti==M/mi: 
15 const umod_t Ti = M / f.primepow(i); // exact division 
16 
17 // ci=1/ Ti: 
18 umod_t ci = inv_modpp(Ti, f.prime(i), f.exponent(i)); 
i // here: O <= ci < mi 
21 // Xi = x[i] * ci * Ti: 
22 umod t Xi = ci * Ti; // 0 <= Xi <M 
23 Xi = mul mod(Xi, x[i], M); 
24 
25 // add Xi to result: 
26 R = add mod(R, Xi, M); 
27 } 
35 return R; 
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39.4.2 The underlying construction 1 


We derive the algorithm for CRT recombination from a construction for k coprime moduli. Define T; as 


T; := lI (39.4-1) 
ki 
and c; as 
c; :— T; modm; (39.4-2) 
'Then for X; defined as 
Xi = mici (39.4-3) 
one has 
xo mam = (2 Hii ea 
'Therefore 
Lo So Xr = x2; mod m; (39.4-5) 
k 


For the special case of two moduli m4,» one has 


Ti = Ma, Ta = mi (39.4-6a) 


a = m, modm, co = mj mod ma (39.4-6b) 


The quantities are related by 


Ci M2 +C Mı = 1 (39.4-7) 
and 
Un om 5 Xy = xı cı Ti + £2 c9 Ta (39.4-8a) 
k 
= 2101 M2+292C2 Mı (39.4-8b) 
=> A, (1 — C2 mı) + La Ca M1 (39.4-8c) 
= 2,+(#2—21)(myz* mod m2) mı (39.4-8d) 


'The last equality is used in the code. 


39.5 The order of an element 


The (multiplicative) order r = ord(a) of an element a is the smallest positive exponent so that a” = 1. 
For elements that are not invertible (gcd(a, m) 4 1) the order is not defined. Figure [39.5-A] shows the 
powers of all elements modulo the prime 13. The rightmost column gives the order of those elements that 
are invertible. 


An element a whose r-th power equals 1 is called an r-th root of unity: a” = 1. Modulo 9 both elements 
2 and 4 are 6th roots of unity, see figure|39.5-B 


If a” = 1 but a? Æ 1 for all x < r, then a is called a primitive r-th root of unity. Modulo 9 the element 
2 is a primitive 6th root of unity; the element 4 is not, it is a primitive 3rd root of unity. An element of 
order r is an r-th primitive root of unity. 
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O 12 3 4 5 6 7 8 9 1011 12 <--= exponent 


-----------2-----------2-2-------------------- [order] 
0 1 000000000000 [ --] 
i ode od he E IPR do cdi do 101 4 [ 1] 
2 1248361211 9 510 7 1 [12] 
3 1 39 1 39 1 39 1 3 9 1 [ 3] 
4 1 4 312 910 1 4 312 910 1 [ 6] 
5 1 512 8 1 512 8 1 512 8 1 [ 4] 
6 1 6108 9 212 7 3 5 411 1 [12] 
7 1 710 5 91112 6 3 8 4 2 1 [12] 
8 1 812 5 1 812 5 1 812 5 1 [ 4] 
9 19 3129 3129 3 129 3 1 [ 3] 

10 110 912 3 4 110 912 3 4 1 [ 6] 

11 111 4 5 3712 2 9 810 6 1 [12] 

12 112 112 112 112 112 112 1 [ 2] 


O 12 3 4 5 6 7 8 <--= exponent O 12 3 4 5 6 «--- exponent 

----------2--------------------- [order] ----2--2---2---------------- [order] 
0 1 00000000 [-- toco odo cd d. od. di [ 1] 

i- d.t f i^ t1 1i dix [I 1] 2124875 1 [ 6] 
2124875 12 4 [ 6] 4 147 1471 [ 3] 

3 130000000 [--] 5 15 78 42 1 [ 6] 

4 147 147147 [ 3] 7 17 4 17 41 [ 3] 

5 15 78 4215 7 L[ 6] 8 1 8 18 18 1 [ 2] 

6 160000000 [ --] 

7 17 441 7 4 174 L[ 3] 

8 1 8 18 18 18 1 [ 2] 


Figure 39.5-B: Powers and orders modulo 9 (left), the maximal order is R(9) = 6 = (9). The order 
modulo m is defined only for elements a where gcd(a, m) — 1. 'The table of powers for the group of units 
(Z/9Z)* (right) is obtained by dropping all elements for which the order is undefined. 


The maximal order R(m) is simply the maximum of the orders of all elements for a fixed modulus m. 
For prime modulus p the maximal order equals R(p) = p — 1. We omit the argument p of the maximal 
order where it cannot cause confusion. 


An element of maximal order is an R-th primitive root of unity. Roots of unity of an order different from 
R are available only for the divisors d; of R: if g is an element of maximal order R, then g®/“ has order 
d; (it is a primitive d;-th root of unity): 


ord (g^/4:) = d (39.5-1) 


This is because (g*/%)% = gP = 1 and (g%/%)* 4 1 for k < dj. 


'The factor by which the order of an element falls short of the maximal order is sometimes called the 
index of the element. Let i be the index and r the order, then i- r = R. 


'The concept of the order comes from group theory. The invertible elements modulo m with multiplication 
form a group: the multiplicative group. 'The neutral element is 1. The (multiplicative) order defined above 
is the order in this group, it tells us how often we have to multiply the element to 1 to get 1. We restrict 
orders to positive values, else every element would have order zero. 


With addition things are simpler, all elements with addition form a group with 0 as neutral element: 
the additive group. The additive order of an element in this group tells us how often we have to add 
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the element to 0 to get 0. The additive order of the element a modulo m is simply m/ gcd(a, m). All 
elements coprime to m (and especially 1 and —1) are generators of the additive group. 


The maximal order R of all elements of a group is sometimes called the exponent of the group. With 
certain moduli the powers of the elements of maximal order generate the whole multiplicative group. 
Such elements are called primitive elements, primitive roots, or generators of the group. In what follows 
we describe under which conditions the multiplicative group has generators. 


39.6 Prime modulus: the field Z/pZ = F, = GF(p) 


If the modulus is a prime p, then Z/pZ is the finite field F, = GF(p): all elements except 0 have inverses 
and thereby division is possible in GF(p). The maximal order R equals p — 1. Elements of order R are 
called primitive roots modulo p or generators modulo p. 


If g is a generator, then every element in GF(p) different from 0 is equal to some power g* (1 € e « p) 
of g and its order is R/e. To test whether g is a primitive root we only need to check whether 


g?-U/*$ 4 1modp (39.6-1) 


for all prime divisors of q; of p — 1. To find a primitive root, use a simple search: 


1 function primroot(p) 

2 

3 if p--2 then return 1 

4 f[] := distinct, prime. factors(p-1) 

5 for r:=2 to p-1 

6 t 

7 x := TRUE 

8 foreach q in f[] 

9 { 

10 if r**((p-1)/q)==1 then x:=FALSE 
11 } 

12 if x--TRUE then return r 

14 error("no primitive root found") // p cannot be prime ! 
15 7 


In practice the root is found after only a few tries. Note that the factorization of p — 1 must be known. 
An element of order n in GF(p) is returned by the following function, n must divide p — 1: 


1 function element of order(n, p) 
2 4 

3 R := p-1 // maxorder 

4 r := primroot(p) 

5 x := r**(R/n) 

E return x 


39.7 Composite modulus: the ring Z/mZ 


In what follows we will need the function y, the totient function (or Euler’s totient function). The 
function y(m) counts the number of integers coprime to and less than m: 


pm) :— 3 3 (39.7-1) 


1<k<m 


gcd(k,m)=1 


The sequence of values y(n) is entry A000010 in [312]. The values of y(n) for n € 96 are shown in 
figure For m — p prime we have 


p(p) = p-1 (39.7-2) 
For m composite (m) is always less than m — 1. For m = p" a prime power we have 


p) = p-p! = p*(p-1) (39.7-3) 
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nin) | mem) | mem) | mem) [mem | men) | mem) | nem) 
1: 1 13: 12 25: 20 37: 36 49: 42 61: 60 "3: 72 85: 64 
2: 1 14: 6 26: 12 38: 18 50: 20 62: 30 74: 36 86: 42 
3: 2 15: 8 27: 18 39: 24 51: 32 63: 36 75: 40 87: 56 
4: 2 16: 8 28: 12 40: 16 52: 24 64: 32 76: 36 88: 40 
5: 4 17: 16 29: 28 41: 40 53: 52 65: 48 77: 60 89: 88 
6: 2 18 6 30: 8 42: 12 54: 18 66: 20 78: 24 90: 24 
ré 6 19: 18 31: 30 43: 42 55: 40 67: 66 79: 78 91: 72 
8: 4 20: 8 32: 16 44: 20 56: 24 68: 32 80: 32 92: 44 
9: 6 21: 12 33: 20 45: 24 57: 36 69: 44 81: 54 93: 60 
10: 4 22: 10 34: 16 46: 22 58: 28 70: 24 82: 40 94: 46 
11: 10 23: 22 35: 24 47: 46 59: 58 71: 70 83: 82 95: 72 
12: 4 24 8 36: 12 48: 16 60: 16 72: 24 84: 24 96: 32 


Figure 39.7-A: Values of y(n), the number of integers less than n and coprime to n, for n < 96. 


The totient function is a multiplicative function: one has (x1 12) = (41) (x2) for coprime z1, £2, that 
is, gcd(zi,z2) = 1 (but zı and zs are not required to be primes). Thus, if p; are the distinct primes in 
the factorization of n, then 


en) = [[ vf) where n=][»; (39.7-4) 
An alternative expression for y(n) is 
1 
e(n = n][ (1 = ) where n= | | pj (39.7-5) 
Pi ; 
Pi i 


We note a generalization: the number of s-element sets of numbers < n whose greatest common divisor 
is coprime to n equals 


1 
p(n) = n? II (1 E x) where n= II»; (39.7-6) 
Pi d i 


Pseudocode to compute y(m) for arbitrary m: 


1 function euler. phi (m) 

3 { n, pil, x[] } := factorization(m) // m==product(i=0..n-1, p[i]**x[il) 
5 1239 to n-1 

T k = :- x[i] // exponent 

: T ph := ph * (p[il**(k-1)) * (p[il-1) // ==ph * euler. phi(p[i]**x[il) 
10 7 


The multiplicative group consists of the invertible elements (or units) and is denoted by (Z/mZ)*. The 
size of the group (Z/mZ)* equals the number of units: 


KZ/mZy| = em) (39.7-7) 


€1 


If m factorizes as m = 2° - pf! -... - pg? where p; are pair-wise distinct primes, then 


= p(2)-p(p)-... -p (per) (39.7-8) 


Further, the group (Z/mZ)* is isomorphic to the direct product of the multiplicative groups modulo the 
prime powers: 


[(2/m2) 


(ZlmZ) e (Z/2°Z)* x (Zlp? Z) x «++ x (Z/ppZ) (39.7-9) 
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The relation suggests that we can, instead of working modulo m, do computations modulo all prime 


powers in parallel. The Chinese remainder theorem (section |39.4 on page 772) tells us how to find the 


element modulo m, given the results modulo the prime powers. The other direction is simply modular 
reduction. 


39.7.1 Cyclic and noncyclic multiplicative groups 


012 3 4 5 6 7 8 910 11 12 13 14 012 34 
0 1 0000000000000 0 [ --] 1 1 1111 [ 1] 
1 111 1130 121 11 11230 1231 11 E 1] 2 12 481 € 4] 
2 12 481248 12242812^4 [ 4] 4 141241 [J 
3 1 3 912 6 3 912 6 3 912 6 3 9 [ --] 7 1 7 413 1 [ 4] 
4 14 141414124141 41 [ 2] 8 18 421 € 4] 
5 1 510 510 510 510 510 510 510 [--] 11 111 111 1 [ 2] 
6 1 66 6666666666 6 6 [ --] 13 113 4 7 1 [ 4] 
7 17 413 1 7 413 1 7 413 1 7 4 [ 4] 14 114 114 1 [ 2] 
8 18 4218 42 18 4218 4 [ 4] 
9 19 69 69 69 69 629 69 6 [--] 
10 1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ([ --] 
11 111 111 111 111 111 111 111 1 [ 2] 
12 112 9 3 612 9 3 612 9 3 612 9 [--] 
13 113 4 7 113 4 7 113 4 7 113 4 [ 4] 
14 114 114 114 114 114 114 114 1 [ 2] 


Figure 39.7-B: Powers and orders modulo 15 (left). The group (Z/15Z)* is noncyclic: there are y(15) = 
8 invertible elements but no element generates all of them as the maximal order is R(15) = 4 < y(15). 
The table of powers for the group (Z/15Z)* (right) is obtained by dropping all non-invertible elements. 


If the maximal order R(m) is equal to |(Z/mZ)*| = (m), then the multiplicative group (Z/mZ)* is 
called cyclic, else we call it noncyclic. The term cyclic reflects that the powers of any element of maximal 
order ‘cycle through’ all elements of (Z/mZ)*. An element of maximal order in a cyclic group is also 
called a generator as its powers ‘generate’ all elements. 


Figure[39.7-B| shows the powers and orders of the noncyclic group (Z/15Z)* where no element generates 
all units. The groups (Z/13Z)* and (Z/9Z)* are cyclic, see figure|39.5-A| on page and figure 39.5-B 


For prime modulus m the group (Z/mZ)* contains all nonzero elements and any element of maximal 
order is a generator of the group. 


For m a power p” of an odd prime p the maximal order R in (Z/mZ)* is 


Rip’). = qup") (39.7-10) 
For m a power of 2 an irregularity occurs: 


1 fork =1 
Ry = 2 fork =2 (39.7-11) 
2-9» fork>3 


That is, for powers of 2 greater than 4 the maximal order falls short from q(2*) = 2*7! by a factor of 2. 


ki 


For the general modulus m = 2*9 . pi! +... pat the maximal order is 


R(m) = lem (R(2"°), R (pp), ..., R QE) (39.7-12) 


39.7: Composite modulus: the ring Z/mZ 


39.7.1.1 Computation of the maximal order 


The maximal order R(m) of an element in (Z/mZ)* can be computed as follows: 


1 function maxorder (m) 
2 
3 ín, pl], k[]} := factorization(m) // m==product(i=0..n-1,plil**k[i]) 
à R := 1 
for i:=0 to n-1 
7 1 
8 t := euler phi pp(plil], k[i]) // --euler phi(p[i]**k[il) 
9 if plil==2 AND k[i]>=3 then t :=t / 2 
10 R := 1cm(R, t) 
11 } 
13 return R 
14 7 


«e oo-1o0»cu-c0 NA 
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Now we can see for which moduli m the multiplicative group (Z/mZ)* will be cyclic: 
(Z/mZ)* is cyclic for m=2, 4, pë, 2. p" where p is an odd prime 
If the factorization of m contains two different odd primes p, and pp, then 


R(m) = lem(..., plpa), ..., e(py),...) 


is at least by a factor of 2 smaller than 


pim) = ...«plDa)*...* (po)... 
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(39.7-13) 


because both y(pa) and p(p») are even. So (Z/mZ)* cannot be cyclic in that case. The same argument 
holds for m = 2*e . p! if ky > 1. For m = 2* the group (Z/m2Z)* is cyclic only for k = 1 and k = 2 


because of the mentioned irregularity of power of 2 (relation |39.7-11]. 
39.7.1.2 Computation of the order of an element 


Pseudocode for a function that returns the order of a given element z in (Z/mZ)*: 


function order(x, m) 


1 
if gcd(x,m)!-1 then return O // x not a unit 
h := euler phi(m) // number of units 
e:=h 
ín, p[], k[] } := factorization(h) // h==product(i=0..n-1,p[i]**k[i]) 
for i:-0 to n-1 
f := pl[il**k[i] 
e:=e/f 
gl := x**e mod m 
while gi!-1 
1 
gi := gi**p[i] mod m 
e :7 e * pli] 
} 
} 
return e 
} 


Pseudocode for a function that returns an element x in (Z/mZ)* of maximal order: 
function maxorder. element (m) 


R := maxorder (m) 
for x:-1 to m-1 


if order(x, m)--R then return x 


} 


// never reached 
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Again, while the function does a simple search it is efficient in practice. For prime m the function returns 
a primitive root. A C++ implementation is given in [FXT: mod /mixorder e| Note that for noncyclic 
groups the returned element does not necessarily have maximal order modulo all factors of the modulus. 
We list all elements of (Z/15Z)* together with their orders modulo 15, 3, and 5: 


1: r-1 r3-1 rb-1 
2: r-4 r3-2 r5=4 
4: r-2 r3-1 rb5-2 
T: r-4 r3-1 rb=4 <--= 
8: r-4 r3-2 r5=4 
11: r-2 r3-2 rb5-1 
13: r-4 r3-1 rb5-4 <--= 
14: r-2 r3=2 r5=2 


The two elements marked with an arrow have maximal order modulo 15, but not modulo 3. An element of 
maximal order modulo all factors of a composite modulus (equivalently, maximal order in all subgroups) 
can be found by computing a generator for all cyclic subgroups and applying the Chinese remainder 


algorithm given in section|39.4 on page 772 
39.7.2 Generators in cyclic groups 


Let G be the set of all generators in a cyclic group modulo n. Then the number of generators is given by 


IG] = ¿(p(n)) (39.7-14) 


Let g be a generator, then g^ is a generator if and only if gcd (k, p(n)) = 1. There are y (p(n)) numbers 
k that are coprime to y(n). 


Let g be a generator modulo a prime p. Then g is a generator modulo 2p* for all k > 1 if g is odd. If g 
is even, then g + p* is a generator modulo 2p*. 


Further, g is a generator modulo p^ if g?~! mod p? 4 1. The only primes below 2% zz 68 - 10° for which 
the smallest primitive root is not a generator modulo p? are 2, 40487 and 6692367337. Such primes are 


called non-generous primes, see entry A055578 in [312]. 


The only known primes p below 32 - 10!? where 2?! = 1 modulo p? are 1093 and 3511 (such primes are 
called Wieferich primes, see entry A001220 in [312]). Now 2 is not a generator modulo either of the two. 
Thus, whenever 2 is a generator modulo a prime p < 32 - 10!?, it is also a generator modulo p^ for all 
k> 1. 


39.7.3 Generators in noncyclic groups 
If the group is cyclic, an element of maximal order generates all invertible elements. With noncyclic 


groups one needs more than one generator. GP’s function znstar() gives the complete information 
about the multiplicative group of units. The help text reads: 


znstar(n): 3-component vector v, giving the structure of (Z/nZ)^*. 
v[1] is the order (i.e. eulerphi(n)), 

v[2] is a vector of cyclic components, and 

v[3] is a vector giving the corresponding generators. 


Its output for 2 < n < 25 is shown in figure [39.7-C] 
The group is cyclic if there is just one generator. In general, when znstar(n) returns 
lo, [ris 72... Tx]; [91,92,.-., 9%]) (39.7-15) 
then the (y invertible elements u are of the form 
u = 99? iem (39.7-16) 


where 0 € e; < r; for 1 < i < k. For example, with n = 15: 
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? for(n-2,25,print(n," ",znstar(n))) 

2[1, 0, (11 /* read: [1, [1], [Mod(1,2)]] */ 
3 [2, [2], [Mod(2, 3)]] 

4 [2, [2], [Mod(3, 4)]] 

5 [a [4], [Moa(2. 55]] 

6 [2, [2], [Mod(5, 6)]] 

7 [6, [6], ([Mod(3, 7)]] 

8 [4, [2, 2], [Mod(5, 8), Mod(3, 8)]] 

o te. [6], [Moa(2, 9)1] 

10 [4, [4], [Mod(7, 10)]] 

11 [10, [10], [Mod(2, 1101] 

12 [4, [2, 2], [Mod(7, 12), Mod(5, 12)]] 
13 [12, [12], [Mod(2, 13)]] 

14 [6, [6], [Mod(3, 14)]] 

15 [8, [4, 2], [Mod(8, 15), Mod(11, 15)]] 
16 [8, [4, 2], [Mod(5, 16), Mod(7, 16)]] 
17 [16, [16], [Mod(3, 17)]] 

18 [6, [6], ([Mod(11, 18)]] 

19 [18, [18], [Mod(2, 19)]] 
20 [8, [4, 2], [Mod(3, 20), Mod(11, 20)]] 
21 [12, [6, 2], [Mod(5, 21), Mod(8, 21)]] 


[Mod (13, 22)]] 
[Mod(5, 23)]] 
24 [8, [2, 2, 2], [Mod(13, 24), Mod(19, 24), Mod(17, 24)]] 
[Mod(2, 25)]] 


Figure 39.7-C: Structure of the multiplicative groups modulo n for 2 € n < 25. 


~N 


znstar (15) 

[8, [4, 2], [Mod(8, 15), Mod(11, 15)]] 
7 gl=Mod(8, 15); g2=Mod(11,15); 
for(e1=0,4-1,for(e2=0,2-1,print(el," ",e2," ",gli^ei*g2^e2))) 
Mod(1, 15) 

Mod(11, 15) 

Mod(8, 15) 

Mod(13, 15) 

Mod(4, 15) 

Mod(14, 15) 

Mod(2, 15) 

Mod(7, 15) 


N 


WUNNRROO 
PROPROPRORO 


The multiplicative group modulo n = 2* is cyclic only for k < 2: 


? for(i=1,6,print(i,": ",znstar(27i))) 
1: [ 


[us 


> 


> [ 
2: [2, [2], [Mod(3, 4)]] 
3: [4, [2, 2], [Mod(5, 8), Mod(3, 8)]] 
4: [8, [4, 2], [Mod(5, 16), Mod(7, 16)1] 
5: [16, [8, 2], [Mod(5, 32), Mod(15, 32)]] 
6: [32, [16, 2], [Mod(5, 64), Mod(31, 6401] 


For k > 3 the multiplicative group is generated by the two elements 5 and —1. 


39.7.4 Inversion by exponentiation 


For a unit u of order r = ord(u) one has u^ = 1. As r divides the maximal order R also u? = 1 holds 
and so u®-!.u=1. That is, the inverse of any invertible element u equals u to the (R — 1)-st power: 


uw! = yt! (39.7-17) 


1 


In fact, one has also u^! = u*("?-! which may involve slightly more work if the group is noncyclic. 


39.8 Quadratic residues 


Let p be a prime. The quadratic residues modulo p are those values a so that the equation 


z^? £2 @ (mod p) (39.8-1) 
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has a solution. If the equation has no solution, then a is called a quadratic non-residue modulo p or 
simply a non-residue. A quadratic residue is a square (modulo p) of some number, so we can safely just 
call it a square modulo p. Another short form is simply residue. 


Let g be a primitive root (the particular choice does not matter), then every nonzero element x can 
uniquely be written as x = g* where 0 < e < p. Rewriting equation |39.8-1]as a? = (g^)? = g?* =a makes 
it apparent that the quadratic residues are the even powers of g. The non-residues are the odd powers of 
g. All generators are non-residues: g = gl. 


Let us compute f(x) := x("-9/? for both residues and non-residues: With a quadratic residue g?* we 

get f(g?*) = g?*(»-0/? = 1° = 1 where we used g?-! = 1. With a non-residue a = g^, k odd, we get 

f(a) = f(g*) = g*(»-)/? = —1 where we used g(?-9/? = —1 (the only square root of 1 apart from 1 is 
1) and —1* — —1 for k odd. 


Apparently we just found a function that can tell residues from non-residues. In fact, we rediscovered 
a : 
the Legendre symbol usually written as (2). A surprising property of the Legendre symbol is the law 


of quadratic reciprocity: Let p and q be distinct odd primes, then 


(2) 2 (ames (2) (39.8-2) 


Also the following relations hold: 


= 22 f +1 if p=1 (mod 4) 
(5) = d i { -1 if de ar 4) (39.8-3a) 
2 pn _ +1 if => (m d 8) 
B als E { —1 if = +3 (iod 8) (39.8-3b) 
(5) dE i (39.8-3c) 
p 


1 p=2, p=3, orp=1mod3 (39.8-3d) 


Pme 

dE 

NS 
Il 


If a is a square modulo p, then the polynomial z?—a (with coefficients modulo p) factors as (1—r1) (z—72) 
where r? = a and r2 =a. The number —1 is a square modulo 41 = 4- 10+ 1 and we have z? +1 = 
(x — 9) (x — 32). The polynomial x? + 1 with coefficients modulo 43 = 4- 10 + 3 is irreducible, —1 is not 
a square modulo 43. 


The relation between the Legendre symbols of positive and negative arguments is 


—a ez fla +(£) if p=4k+1 
— = (-1) ?2 -] = P x 39.8-4 
( p ) (71) (2) { (6) if p=4k+3 ( ) 
Modulo a prime p = 4k + 3, if +a is a square, then —a is not a square. The orders of any two elements 
+a and —a differ by a factor of 2. Non-residues can easily be found: —(b?) is a non-residue for all b. 


Modulo a prime p = 4k +1, if +a is a square, then —a is also a square. The orders of two non-residues 
+a and —a are identical. The orders of two residues +a and —a can be identical or differ by a factor of 
2. 


A special case are primes of the form p = 2? +1, the Fermat primes. Only five Fermat primes are known 
today: 21 +1 = 3, 22 +1 = 5, 24 +1 = 17, 28 + 1 = 257 and 218 + 1 = 65537. To be prime the exponent 
x must be a power of 2. The primitive roots are exactly the non-residues: the maximal order equals 
R= (p) = 2*. There are p (q(p)) = 2%! primitive roots. There are (p — 1)/2 = 2*-! squares which all 
have order at most R/2. The remaining 2*-! non-residues must all be primitive roots. 


We will not pursue the issue, but it should be noted that there are more efficient ways than powering 
to determine the Legendre symbol. A generalization of the Legendre symbol for composite moduli is 


the Kronecker symbol. An efficient implementation for its computation (following [110] p.29]) is given in 
[FX T: mod/kronecker.cc|: 
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ba 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 
0: 0«-00000000000000000000000000000000000 
1: BB GG GAB GAB RR GAB B GM RR Wn  G 4B cG| O O A 
2: 0+0-0-0+0+0-0-0+0+0-0-0+0+0-0-0+0+0-0 
3: O+-0+-0+-0+-0+-0+-0+-0+-0+-0+-0+-0+-0 
4: O+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0 
5: O+--+4+0+4+--4+04+--4+04+--4+0+--40+--+4+0+--4+0+H 
6: O0+000+4+0+4+000+4+0-000-0-000-0+000+0+4+000+40 
T: O+F+-4--O04+4+-4+--O04+4-4+--O04+4+-4+--O04+4+-4--084 
8: O+0-0-0+0+0-0-0+0+0-0-0+0+0-0-0+0+0-0 
9: O++0++0++0++0++0++0++0++0++0++0++0++0 
10: 0+0+000-0+0-0+000-0-0-0-000+0-0+0-000 
11: O+F-+4+4+---4+-O04+ -4+4+4+---4+-O04-4+4+4+---4+-04+-4 
12: 0+000-0+000-0+000-0+000-0+000-0+000-0 
13: O+ -++4+----+4+-4+ 04-4 4+----4+4+-4+ 04+ -4+4+----4 4 
14: O+0+0+000+0-0+0+0-0+000+0+0+0-0-0-000 
15: O++0+00-+00-0--0++0+00-+00-0--0++0+00 
16: O+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0 
17: O++-+---++---+-++0++-+---++---+-++0H++ 
18: 0+000-0+000-0-000+0-000+0+000-0+000-0 
19: O+ --+4+4+4+-4+-4----4+ 4-04 --4+ 4+4+4+-4+ -4+---- 44 
20: 0+0-000-0+4+0+0-000-0+0+4+0-000-04+0+0-000 


OPN CO» C! A MINA 


Figure 39.8- A: Kronecker symbols ($) for small positive a and b. 


int 
kronecker(umod t a, umod_t b) 
// Return Kronecker symbol (a/b). 
// Equal to Legendre symbol (a/b) if b is an odd prime. 
{ 
static const int tab2[] = {0, 1, 0, -1, 0, -1, 0, 1}; 
// tab2L a & 7 ] := (-1)7((a72-1)/8) 
if ( O==b ) return (1==a); 
if ( 0==((alb)&1) ) return 0; // a and b both even ? 
int v = 0; 
while ( 0==(b&1) ) 4 ++v; b>>=1; } 
int k; 
if ( 0==(v&1) ) k= 1; 
else k = tab2L[La& 7 ]; 
pnus (1) 
if ( O==a ) return ( b>170: k); 
v=0; 
while ( O--(a&1) ) i ++v; a>>=1; } 
if ( 1--(v&1) ) k *= tab2[ b & 7]; // k *= (-1)**((bxb-1)/8) 
if (a&b&2) k=-k; // k = k*(-1)**((a-1)*(b-1)/4) 
umod_t r = a; // signed: r = abs(a) 
a=b%r; 
b= r; 
} 
} 


A table of Kronecker symbols for small a and b is shown in figure |39.8-A| It was created with the 
program [FXT: imod/kronecker-demo.cc|. 


The following relations hold for the Kronecker symbol: 
ab a b 
a a a 
Co RE C NES) 


a 


Note we may have (=) = +1 while a is not a square modulo mn: If (+) = (2) = —1 (a is a non-square 


modulo both m and n), then (by relation |39.8-5b) () = +1. But a is not a square mod mn, as a 


mn 
square mod mn must be a square both mod m and mod n. For example, (3) = +1 but 2 is not a square 
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modulo 143 = 11-18, we have (4) — —] and (&) = —], so 2 is a non-square modulo both primes and 
so modulo their product. 


For a square b = a? the Kronecker symbol will always be +1: (2) = (2): (2) = +1 (by relation [98-52]. 


Whether a given number a is a square modulo 2% can be determined via the simple routine [FXT: 
mod/quadresidue.cc |: 


1 bool is. quadratic residue 2ex(umod t a, ulong x) 
2 // Return whether a is quadratic residue mod 2**x 
3 4 

4 if ( x--1 ) return true; 

5 if ( (x»23 ) && (1==(a&7)) ) return true; 

6 if ( (x==2 ) && (1==(a%3)) ) return true; 

T return false; 

8 P 


A curious observation regarding quadratic residues is that exactly for the 29 moduli 


2, 3, 4, 5, 8, 12, 15, 16, 24, 28, 40, 48, 56, 60, 72, 88, 112, 120, 
168, 232, 240, 280, 312, 408, 520, 760, 840, 1320, 1848 


all quadratic residues are non-prime. This sequence is entry A065428 in [312]. It can be generated using 
the program [FXT: mod/mod-residues-demo.cc|. 


See any textbook on number theory for the details of the theory of quadratic residues and [110], [221], 
323|, and for the corresponding algorithms. A method for watermarking that uses quadratic 
residues is discussed in [25]. An algorithm to compute conference matrices via quadratic residues is given 


in section on page 
39.9 Computation of a square root modulo m 
We give algorithms for computing square roots modulo primes, prime powers, and composites. 


39.9.1 Square roots modulo a prime 


The square roots of a square a modulo a prime p = 4k + 3 can be computed as 


ya = +ale+d/4 (39.9-1) 


Observe that (a(P+D/4)2 = g@t)/? — q(-)/2+1 — +1. a = +a. If a is not a square, then a square root 
of —a is obtained. Similar expressions for square roots modulo p are developed in [3]. An algorithm for 


the computation of a square root modulo a prime p (without restriction on the form of p) is given in [110] 
p.32]. We just give a C++ implementation [FXT: mod /sqrtmod.cc!: 


1 umod t 

2 sqrt modp(umod t a, umod t p) 

3  // Return x such that x*x--a (mod p) 

4  // p must be an odd prime. 

5  // If a is not a square mod p then return 0. 

6 (t 

7 if ( 1!=kronecker(a,p) ) return 0; // not a square mod p 
8 

9 // initialize q,t so that p--q*2^t-«1 
10 umod t q; int t; 

11 n2qt(p, q, t); 

13 umod_t z = 0, n= 0; 

14 for (n=1; n<p; ++n) 

15 { 

16 if ( -1==kronecker(n, p) ) 

17 { 

18 z = pow_mod(n, q, p); 

19 break; 

20 } 

21 } 
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23 if ( n>=p ) return 0; 

3% umod_t y = z; 

26 uint r = t; 

27 umod_t x = pow_mod(a, (q-1)/2, p); 

28 umod_t b = x; 

29 x = mul_mod(x, a, p); 

30 b = mul_mod(b, x, p); 

31 

32 while ( 1) 

33 { 

34 if ( 1==b ) return x; 

3g uint m; 

37 for (m=1; m<r; ++m) 

38 

39 if ( 1==pow_mod(b, 1ULL<<m, p) ) break; 
40 } 

41 

42 if ( m--r ) return 0; // a is not a square mod p 
43 

44 umod t v = pow mod(y, 1ULL««(r-m-1), p); 
45 y = sqr mod(v, p); 

46 r=m; 

47 x = mul_mod(x, v, p); 

48 b = mul_mod(b, y, p); 

49 } 

50 > 


39.9.2 Square roots modulo a prime power 


For the computation of a square root modulo a prime power p* the Newton iteration can be used (see 


section [29.1.5 on page 569). The case p = 2 has to be treated separately [FXT: ¡mod /sqrtmod.cc|: 


1 umod t 

2  sqrt modpp(umod t a, umod t p, long ex) 

3  // Return r so that r^2 -- a (mod p^ex) 

4  // return O if there is no such r 

5 

6 umod t r; 

T 

8 if ( 2==p ) // case p-- 

9 

10 if ( false--is quadratic residue 2ex(a, ex) ) 
11 else r= 1; // (1/r)^2 = a mod 2 
12 } 

13 else // case p odd 

14 1 

15 umod_t z =a % p; 

16 r = sqrt_modp(z, p); 

17 if ( r==0 ) return 0; // no sqrt exists 
18 } 

19 // here r^2 == a (mod p) 
20 
21 if ( 1==ex ) return rT; 


return 0; // no sqrt exists 


Here r is a square root of a modulo p, Newton steps are used to compute ya modulo powers of p: 


1 const umod t m = ipow(p, ex); 

2 if ( 2==p ) // case p-- 

3 

4 long x = 1; 

5 while ( x«ex ) // Newton iteration for inverse 
6 1 

T umod_t z = a; 

8 z = mul mod(z, r, m); // a*r 

9 z = mul mod(z, r, m); // a*r*r 

10 z = sub mod(3, z, m); // 3 - a*r*r 

11 r = mul mod(r, z/2, m); // r*(3 - a*r*r)/2 
12 x *-2; // (1/r)^2 == a mod 2^x 

13 } 

14 r = mul_mod(r, a, m); 

15 } 

16 else // case p odd 


17 1 


Sqrt, 2-adic case 


= r*(1 + (1-ax*r*r)/2) 


&« oOo-I1oc»cuco0b.- 
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const umod_t h = inv_modpp(2, p, ex); // 1/2 
long x = 1; 
while ( x<ex ) // Newton iteration for square root 
1 
umod t ri - inv modpp(r, p, ex); // i/r 
umod t ar = mul mod(a, ri, m); // a/r 
r = add, mod(r, ar, m); // r+a/r 
r = mul mod(r, h, m); // (r*a/r)/2 
x *= 2; // r^2 == a mod px 


} 


return r; 


} 


39.9.3 Square roots modulo an arbitrary number 


Square roots modulo an arbitrary number can be computed from the square roots of its prime power 
factors using the Chinese remainder theorem (see section 39.4 on page 772) [FX T: mod/sqrtmod.cc|: 


umod, t 
Sqrt modf(umod t a, const factorization &mf) 
// Return sqrt(a) mod m, given the factorization mf of m 
1 
ALLOCA(umod t, x, mf.nprimes() ); 
for (int i-0; i«mf.nprimes(); ++i) 
// xlil=sqrt(a) modulo i-th prime power: 
x[i] = sqrt modpp( a, mf.prime(i), mf.exponent(i) ); 
if ( xlil==0 ) return 0; // no sqrt exists 
} 
return chinese(x, mf); // combine via CRT 
} 


39.10 The Rabin-Miller test for compositeness 


We describe a probabilistic method to prove compositeness of an integer. 


39.10.1  Pseudoprimes and strong pseudoprimes 


For a prime p the maximal order of an element equals p — 1. That is, for all a 4 0 
a"! = | modp (39.10-1) 


If for a given number n one finds an a > 1 so that a”! 4 1 mod n, then the compositeness of n has 
been proved. Composite numbers n for which a"~! = 1 mod n are called pseudoprime to base a (or 
a-pseudoprime). For example, for n — 15 we find 


a: 2 3 4 5 6 7 8 9 10 11 12 13 14 
alt: 4 $ 1 10 6 4 4 6 10 1 9 4 1 


We found that 15 is pseudoprime to the bases 4, 11 and 14 which we also could have read off the rightmost 
column of figure |39.7-B| on page 


The bad news is that some composite numbers are pseudoprime to very many bases. The smallest such 
number is 561 which is pseudoprime to all bases a with gcd(a,n) = 1. Numbers with this property 
are called Carmichael numbers. The first few are 561, 1105, 1729, 2465, 2821, 6601, 8911, ..., this is 
sequence [A002997] in [312]. There are infinitely many Carmichael numbers as proved in [7]. Finding a 
base that proves a Carmichael number composite is as difficult as finding a factor. 


A significantly better algorithm can be found by a rather simple modification. Write n — 1 =: q: 2° 
where q is odd, we examine the sequence b :— a?, b?, bt, ..., b? | = a(n-0/2, We say that n is a strong 
pseudoprime to base a if either b = 1 or b? = —12 n — 1 for some e where 0 < e < t. We abbreviate 
strong pseudoprime as SPP. If neither of the conditions holds, then n is proved composite. Then n is 
either not a pseudoprime to base a or we found a square root of 1 that is not equal to n — 1. 
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With two different square roots $1, s2 modulo n of a number z (here z = 1) we have 


s2—z = 0 modn (39.10-2a) 
-z = 0 modn (39.10-2b) 
si—s%? = (sı +s2)(sı— s2) = 0 modn (39.10-2c) 


So both sı + s2 and sı — sa are nontrivial factors of n if s; Z n — s2. Thus a square root s 4 — 1 of 1 
proves compositeness because both gcd(s + 1, n) and gcd(s — 1, n) are nontrivial factors of n. 


Let B — [o b?,b4,...,b? |, then for n prime the sequence B must have one of the following forms: either 
BS Lila tl e (39.10-3a) 
B = [x,...,*,-1,1,...,1) (39.10-3b) 


where an asterisk denotes any number not equal to +1 mod n (notation as in [221]). For n composite the 
sequence P can also be of the form 


B = [kysees] (a”=1 4 1, not a pseudoprime to base a) or (39.10-4a) 
B = lx,...,*,1,...,1] (found square root of 1 not equal to —1) (39.10-4b) 


If one of the latter two forms is encountered, then n must be composite. 


With our example n = 15 we have n — 1 = 7- 2!, thereby q = 7 and t = 1. We only have to examine the 
value of b. Values of a for which b is not equal to either --1 or —1 prove the compositeness of 15. 


a2 3 4 5 6 7 8 9 10 11 12 13 14 
06:8 12 4 5 6 13 2 9 10 11 3 7 -1 


In our example all bases 4 14 prove 15 composite. As n is always an SPP to base a = n— 1 = — 1, we 
restrict our attention to values 2 < a € n — 2. 


A GP implementation of the test whether n is an SPP to base a: 


1 sppqín, a)= 
2 1 /* Return whether n is a strong pseudoprime to base a */ 
3 local(q, t, b, e); 
4 q = n-1; 
5 t=0; 
6 while ( O--bitand(q,1), q/=2; t+=1 ); 
7 /* here n==2"t*q+1 */ 
8 
9 b = Mod(a, n)^q; 
10 if ( 1==b, return(1) ); 
11 e=1; 
12 while ( e<t, 
13 if( (b==1) || (b==n-1), breakO; ); 
14 b *= b; 
15 ett; 
); 
17 return( if ( b!2(n-1), 0, 1) ); 
18 } 


The Carmichael number 561 (561 — 1 = 35 - 24, so q = 35 and t = 4) is an SPP to only 8 out of 
the 558 interesting bases and not an SPP for any 2 < a < 20, as shown in figure [39.10-A] Note that 
with a = 4 we found s = 67 where s? = 1 mod 561 and thereby the factors gcd(67 + 1,561) = 17 and 
gcd(67 — 1,561) — 33 of 561. 


39.10.2 "The Rabin-Miller test 


The Rabin-Miller test is an algorithm to prove compositeness of a number n by testing strong pseudo- 
primality with several bases: 
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a=2: b-263 b^2-166 b^4- 67 

a-3: b= 78 b^2-474 b”4=276 

a=4: b=166 b°2= 67 b^4- 1 all SPP bases: 
a=5: b= 23 b^2-529 b^4-463 a-50: b-560 
a=6: b=318 b”"2=144 b”4=540 a=101: b=560 
a=7: b=241 b°2=298 b^4-166 a=103: b= 1 
a=8: b=461 b”"2=463 b^4- 67 a=256: b= 1 
a=9: b=474 b°2=276 b^4-441 a=305: b=560 
a=10: b=439 b”"2=298 b^4-166 a=458: b=560 
=11: b=209 b^2-484 b”"4=319 a=460: b= 1 
a=12: b= 45 b^2-2342 b^4-276 a-b11: b= 1 
a=13: b-208 b°2= 67 b^4- 1 

a=14: b-551 b72=100 b^4-463 

a=15: b-111 b°2=540 b^4-441 

a=16: b= 67 b^2- 1 

a=17: b-527 b^2- 34 b^4- 34 

a=18: b=120 b°2=375 b^4-375 

a=19: b= 76 b^2-2-166 b^4- 67 

a=20: b=452 b”"2=100 b”4=463 


OBNDUBWNEH 


Figure 39.10-A: The Carmichael number 561 = 35-2* + 1 is a strong pseudoprime to 8 out of 558 bases 
a (right) and no base 2 < a < 20 (left). 


rm(n, na-20)- 
{ /* Rabin Miller test */ 
local(a); 
for (a=2, na+2, 
if ( a»n-2, break() ); 
if ( O--sppq(n, a), return(0) ); /* proven composite */ 


); 
return(1); /* composite with probability less than 0.25^na */ 
} 


For a composite number the probability of being a SPP to a ‘random’ base is at most 1/4. So the 
compositeness of a number can quickly be proved in practice. While the algorithm does not prove 
primality, it can be used to rule out compositeness with a very high probability. 


Bases tested: 2 3 5 6 7 10 11 12 13 14 15 17 


91: [3] 10 12 17 
133: [2] 11 12 

145: [2] 12 17 
276: [2] 11 13 

286: [2] 3 17 
703: [2] 3 7 

742: [2] 15 17 
781: [2] 5 17 
946: [2] 7 15 
1111: [2] 6 17 
1729: [2] 10 12 
2047: [2] 2 11 
2806: [2] 5 13 
2821: [2] 12 17 
3277: [3] 2 14 15 
4033: [2] 2 17 
4187: [2] 10 17 
5662: [2] 5 17 
5713: [2] 6 14 
6533: [2] 6 10 
6541: [2] 14 15 
7171: [2] 14 17 
8401: [2] 3 10 
8911: [3] 3 12 13 
9073: [2] 12 14 


Figure 39.10-B: All numbers < 10,000 that are strong pseudoprimes to more than one base a < 17 
(omitting bases a that are prime powers). 


A list (created with the program [FXT: imod/rabinmiller-demo.cc|) of composites n < 10,000 that are 
SPP to more than one base a X 17 is shown in figure[39.10-B} The table indicates how effective the Rabin- 
Miller algorithm actually is: it does not contain a single number pseudoprime to both 2 and 3. The first 
few odd composite numbers that are SPP to both bases a — 2 and a — 3 are shown in figure [39.10-C] 


There are 104 such composite n < 2%, given in [FXT: data/pseudo-spp23.txt|. This sequence of numbers 
is entry A072276 in [312], entry A001262 gives the base-2 SPPs, and entry A020229) gives the base-3 
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1,373,653 == 1 + 272 * 3°3 * 7 * 23 * 79 

== 829 * 1657 == (1 + 2°2 *3°2 *23) * (1 + 273 *3°2 *23) 
1,530,787 == 1+ 2 * 3 * 103 * 2477 

== 619 * 2473 == (1 + 2 *3 *103) * (1 + 273 *3 *103) 
1,987,021 == 1 + 2°2 * 3°2 * 5 * 7 19 * 83 

== 997 * 1993 == (1 + 272 *3 *83) * (1 + 273 *3 *83) 
2,284,453 == 1+ 272 * 372 * 23 * 31 * 89 

== 1069 * 2137 == (1 + 272 *3 *89) * (1 + 273 *3 *89) 
3,116,107 == 1+ 2 * 3°2 * 772 * 3533 

== 883 * 3529 == (1 + 2 *3°2 *7°2) * (1 + 273 *3°2 *7°2) 
5,173,601 == 1 + 2^5 * 5°2 * 29 * 223 

== 929 * 5569 == (1 + 2^5 *29) * (1 + 276 *3 *29) 
6,787,327 == 1+2%* 3 * 7 * 13 * 31 * 401 

== 1303 * 5209 == (1 + 2 *3 *7 *31) * (1 + 273 *3 *7 *31) 
11,541,307 == 1+2%* 3 * 7 * 283 * 971 

== 1699 * 6793 == (1 + 2 *3 *283) * (1 + 273 *3 *283) 
13,694,761 == 1+ 273 * 372 * 5 * 109 * 349 

== 2617 * 5233 == (1 + 2^3 *3 *109) * (1 + 2^4 *3 *109) 


Figure 39.10-C: The first composite numbers that are SPP to both bases 2 and 3. 


25,326,001 == 1+ 274 * 3°3 * 5°3 * 7 * 67 
== 2251 * 11251 == (1 + 2 *3°2 *5"3) * (1 + 2 *3°2 *5^4) 
161,304,001 == 1+ 2°6 * 3 * 5°3 * 11 * 13 * 47 
== 7333 * 21997 == (1 + 272 *3 *13 *47) * (1 + 272 *3°2 *13 *47) 
960,946,321 == 1+ 2^4 * 3 * 5 * 29 * 101 * 1367 
== 11717 * 82013 == (1 + 272 *29 *101) * (1 + 2°2 *7 *29 *101) 
1,157,839,381 == 1+ 272* 3°3 * 5 * 401 * 5347 
== 24061 * 48121 == (1 + 272 *3 *5 *401) * (1 + 273 *3 *5 *401) 
3,215,031,751 == 1+ 2 * 3°4 * 5°3 * 7 * 37 * 613 


== 151 * 751 * 28351 

== (1 + 2 #3 *b^2) * (1 + 2 *3 *5°3) * (1 + 2 «3^4 *5°2 *7) 
3,697,278,427 == 1+ 2 * 373 * 31 * 563 * 3923 

== 30403 * 121609 == (1 + 2 x373 *563) * (1 + 2^3 *3°3 *563) 


Figure 39.10-D: All composite numbers n < 2°? that are SPP to the three bases 2, 3 and 5. 


SPPs. We note the uneven distribution modulo 12: 
(n%12: num) (1: 75) (5: 9) (7: 18) (11: 2) 


Composites that are SPP to the three bases 2, 3 and 5 are quite rare, figure [39.10-D] shows all 6 such 
composite numbers < 2% (values taken from [272] which lists all such numbers < 25 - 10%). Thus we can 
speed up the Rabin-Miller test for small values of n (say, n < 2%) by only testing the bases a = 2,3,5 
and, if n is a SPP to these bases, look up the composites in the table. The smallest odd composites 
that are SPP to the first k prime bases up to k — 8 are determined in [191], they are given as sequence 


A006945 in [313]. 


composite 


2047 

1373653 
25326001 
3215031751 
2152302898747 
3474749660383 
341550071728321 
341550071728321 


Ua 
"Ü 


to base 


; 11 

; 11, 13 

, 11, 13, 17 [and 19] 
>» 11, 13, 17, 19 


NNNNNNNN'YD 
NNNNN 


Note that if the probability of a base not proving compositeness was exactly 1/4 we would find much 
more entries in figure [39.10-D] Slightly overestimating the number of composites below N as N, there 
should be about (1/4)? N = N/64 entries, that is 22% ~ 6-107 for N = 2%, but we have only six entries. 
So the Rabin-Miller test is in practice significantly more efficient than one may initially assume. Let pz , 
be the probability that a k-bit composite ‘survives’ t passes of the Rabin-Miller test. Then we have, as 
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shown in [121], 
pri < KY for k>2 (39.10-5) 


For large numbers, the bound on the left side is much smaller than 1/4: for example, pioo0,1 < 2 ?9. 
Other bounds given in the cited paper are 


D10010 < 2 (39.10-6a) 
p300,5 < 2 9? (39.10-6b) 
Poi < 27 (39.10-6c) 


The last bound is stronger than that of relation |39.10-5| Still stronger bounds are given in [91], also the 
relation pg, < 4™ for all k > 2 and t > 1. 


Bases tested: 2 3 5 6 7 10 11 12 13 14 
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Figure 39.10-E: Composites < 10" that are SPP to at least four bases. 


The composites < 10’ that are SPP to four or more bases a < 17 are shown in figure [39.10-F] We omit 
values of a that are perfect powers because if n is a base-a SPP, then it is also a base-a* SPP for all 
k > 1. The entry for n = 4224533 (marked with an arrow) shows that a number that is not an SPP to 
two bases a; and az may still be a SPP to the base a, - az (here a, = 2, az = 3). This indicates that 
one might want to restrict the tested bases to primes. All odd composite numbers < 107 that are SPP 
to four or more prime bases a < 17 are 


Bases tested: 2 3 5 7 11 13 17 


1152271: [4] 3 7T 11 13 
1314631: [4] 5 7 11 13 
2284453: [4] 2 3 T 11 


Note that a number that is an SPP to bases a, and ag is not necessarily SPP to the base a4: a9. An 
example is n — 9,006, 401 which is an SPP to bases 2 and 5 but not to base 10: 


9006401: 2 4 5 8 16 18 
All composites < 10’ that are SPP to bases 2 and 3 are also SPP to base 6, same for bases 2 and 5. Out 
of six composites < 10” that are SPP to bases 2 and 7 three are not SPP to base 14: 


2 6 7 9 18 


6 9 11 12 18 
13 


14 
14 


ds 


NOhOhOPOPOI 
AARAA 
NNN NN 
000000000000 
PA ps 
MAAMNMNANO 
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Numbers which are SPP to several chosen bases are constructed in [18] where a composite 337-digit 
number is given that is SPP to all prime bases a « 200. See also [365] and [366]. 


39.10.3 Implementation of the Rabin-Miller test 
A C++ implementation of the test for pseudoprimality is given in [FX T: mod/rabinmiller.cc : 


1 bool 
2 is strong pseudo. prime(const umod t n, const umod t a, const umod t q, const int t) 
3 // Return whether n is a strong pseudoprime to base a. 
4  //q and t must be set so that n--q*2^t-*1 
5 t 
6 umod t b = pow mod(a, q, n); 
T 
8 if ( 1--b ) return true; // passed 
9 // if ( n-i--b ) return true; // passed 
M int e = 1; 
12 te C (b!=1) && (b!-(n-1)) && (e«t) ) 
13 
14 b = mul. mod(b, b, n); 
15 ett; 
16 F 
17 
18 if ( b!=(n-1) ) return false; // =--> composite 
19 
20 return true; // passed 
21 } 
It uses the routine 
1 void 
2 n2qt (const umod t n, umod t &q, int &t) 
3 //Set q,t so that n == q * 2°t + 1 
4 // n must not equal 1, else routine loops. 
5 t 
6 q=n- 1; t=0; 
7 while ( 0==(q € 1) ) { q >>= 1; ++t; } 
8 ] 
Now the Rabin-Miller test can be implemented as 
1] bool 
2 rabin miller(umod t n, uint cm/*=0*/) 
3  // Rabin-Miller compositeness test. 
4  // Return true of none of the bases <=cm prove compositeness. 
5  // If false is returned, then n is proven composite (also for n=1 or n=0). 
6  // If true is returned the probability 
7  // that n is composite is less than (1/4)^cm 
8 (t 
9 if ( n<=1 ) return false; 
10 if (n < small prime limit ) return is_small_prime( (ulong)n ); 
B umod_t q; 
13 int t; 
14 n2qt(n, q, t); 
15 
16 if ( O==cm ) cm = 20; // default 
1T uint c = 0; 
18 while ( ++c<=cm ) 
19 
20 umod_t a = c + 1; 
21 
22 // if n is a c-SPP, then it also is a c**k (k>1) SPP. 
23 // That is, powers of a non-witness are non-witnesses. 
24 // So we skip perfect powers: 
25 if ( is_small_perfpow(a) ) continue; 
26 
27 if (a»-n) return true; 
28 if ( lis strong pseudo prime(n, a, q, t) ) return false; // proven composite 
29 } 
30 
31 return true; // strong pseudoprime for all tested bases 
32 } 


The function is_small_perfpow() [FXT: mod/perfpow.cc| returns true if its argument is a (small) 
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perfect power. It uses a lookup in a precomputed bit-array. 


A generalization of the Rabin-Miller test applicable when more factors (apart from 2) of n — 1 are known 
is given in [49]. The Frobenius test is described in [118] p.145], see also 237]. Another generalization 
(named extended quadratic Frobenius primality test) is suggested in [122]. 


39.11 Proving primality 


We describe several methods to prove primality. Only the first, Pratt's certificate of primality, is applicable 
for numbers of arbitrary form but not practical in general because it relies on the factorization of n — 1. 
'The Pocklington-Lehmer test only needs a partial factorization of n — 1. We give further tests applicable 
for numbers of special forms: Pepin's test, the Lucas-Lehmer test, and the Lucas test. 


As already said, the Rabin-Miller test can only prove compositeness. Even if a candidate ‘survives’ many 
passes, we only know that it is prime with a high probability. 


39.11.1  Pratt's certificate of primality 


Only with a prime modulus p the maximal order equals R = p — 1. To determine the order of an element 
modulo p one needs the factorization of p — 1. If the factorization of p — 1 is known and we can find à 
primitive root, then we do know that p is prime. Thus it is quite easy to prove primality for numbers 
of certain special forms. For example, let p :— 2 - 39% + 1 = 411, 782, 264,189, 299. One finds that 3 is a 
primitive root and so we know that p is prime. 


[314159311, [3], [2, 3, 5, 199, 1949]] 


[2, noi] 

[3, [2], [21] 
[2, noi] 

[5, [2], [21] 
[2, noi] 

[199, [3], [2, 3, 11]] 
[2 noi] 


[3, [2], [2]] 
[2. 
(11, [2]. (2, 81] 


[2, 
[5, [2], [21] 


[2, "--"] 
[1949, [2], [2, 4871] 
[2, "--"] 
[487, [3], [2, 31] 
[2, "--"] 
[3, [2], [21] 
[2, "--"] 


Figure 39.11-A: A certificate for the primality of p — 314, 159,311. 


In general, the factorization of p — 1 can contain large factors whose primality needs to be proven. 
Recursion leads to a primality certificate in the form of a tree which is called Pratt’s certificate of 
primality. 


A certificate for the primality of p = 314,159,311 is shown in figure |39.11-A| The first line says that 3 
is a primitive root of p — 314159311 and p — 1 has the prime factors 2,3,5,199, 1949 (actually, p — 1 — 
2. 34. 5.199 - 1949, but we can ignore exponents). The second level, indented by 4 characters, gives the 
prime factors just determined together with their primality certificates: the prime 2 is trivially accepted, 
all other primes are followed by their (further indented) certificates. 

'The certificate was produced with the following GP code: 
1  indprint(x, ind)- 


2 4 /* print x, indented by ind characters */ 
3 for (k-1, ind, printi(" ") ); 
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[314159265358979323846264338327950288419716939937531, [3], \ 
[2, 3, 5, 67, 89, 151, 39829177707048956693, 292001794929603845621939]] 


[151, [6], [2, 3, 51] 
[39829177707048956693, [2], [2, 9957294426762239173]] 
[9957294426762239173, [6], [2, 3, 7, 11, 14153, 385109, 1977139]] 
[14153, [3], [2, 29, 611] 
[385109, [2], [2, 43, 2239]] 
[2239, [3], [2, 3, 3731] 
[373, [2], [2, 3, 31]] 
[1977139, [3], [2, 3, 109841]] 
[109841, [3], [2, 5, 1373]] 
[1373, [2], [2, 7]] 
[292001794929603845621939, [2], [2, 13, 3157127, 3557296955910619]] 
[3157127, [7], [2, 7, 225509]] 
[225509, [2], [2, 56377]] 
[56377, [5], [2, 3, 291] 
[3557296955910619, [3], [2, 3, 47, 673, 6247908971]] 
[673, [5], [2, 3, 71] 
[6247908971, [2], [2, 5, 624790897]] 
[624790897, [5], [2, 3, 130164771] 
[13016477, [2], [2, 11, 29, 101]] 
[101, [2], [2, 51] 


Figure 39.11-B: A shortened certificate for the primality of first prime greater than ~ - 10%. Here all 
primes less than 100 are considered trivially verifiable and not listed. 


4 print (x); 
5 3} 
6 
7 | pratt(p, ind=0)= 
8 (t 
9 local( a, pl, f, nf, t ); 
10 if ( p«-2, \\ 2 is trivially prime 
11 indprint([p, "--"], ind); 
12 return(); 
13 H 
14 \\ p-1 is factored here: 
15 a = lift( znprimroot(p) ); 
16 \\ but we cannot access the factorization, so we do it "manually": 
17 pi = p-1; 
18 f = factor(p1); 
19 nf = matsize(f) [1]; 
20 t = vector(nf,j, f[j,1] ); f= t; \\ prime factors only 
21 indprint([p, [a], t], ind); 
22 MX recurse on prime factors of p-1: 
23 for (k-1, nf, pratt(f[k], ind*4)); 
24 return(); 
25 } 
? p-nextprime(Pi*10^8); 
? pratt(p) 


'The routine has to be taken with a grain of salt as we rely on znprimroot(p) failing for composite p: 


? pratt (1000) 
*** — primitive root does not exist in gener 


The routine has an additional parameter ind determining the indentation used with printing. This 
parameter is incremented with the recursion level, resulting in the tree-like structure of the output. This 
little trick is often useful with recursive procedures. 


With a precomputed table of small primes (see section [39.3 on page 770) the line 
if ( p«-2, \\ 2 is trivially prime 

can be changed to something like 
if ( (p«-ptable max) && (ptable[p]==1), | WW trivial to verify 
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which will shorten the certificate significantly. A certificate for the smallest prime p greater than 7 - 10% 


and ptable.max—100 is shown in figure |39.11-B} the output of ‘trivial’ primes is suppressed. We note 
that p = [v - 105°] + 20. 


Once a certificate is computed it can be verified very quickly. As this type of primality certificate needs 
the factorization of p — 1 its computation is in general not feasible for large values of p. 


39.11.2 The Pocklington-Lehmer test 


Let p—1=F'-U where F > U and the complete factorization of F is known. If, for each prime factor q 


of F, we can find ag such that 7 m = 1 mod p and gcd C — Lp) = 1, then pis prime. 


The corresponding algorithm is called the Pocklington-Lehmer test for primality. The following imple- 
mentation removes entries from the list of prime factors q of F' until the list is empty: 


pocklington_lehmer(F, u, c=10000)= 
1 /* Pocklington-Lehmer test for the primality of p=f*u+1. 
* Return last successful base, else zero. 


* F must be the factorization of f. 
* Test bases az2...c 

* Must have u<f. 

*/ 

local(n, f, C, p, t, ct); 


n = matsize(F) [1]; 

f = prod(j=1, n, F[j,117F[j,21); 
if ( f<=u, return(0) ); 

p = f*u + 1; 

C = vector(n, j, (p-1)/F[j,1]); 


ct =n; \\ number remaining prime divisors of f 
for (a=2, c, 
if ( 1==Mod(a,p)” (p-1), 
for (j=1, n, 
if ( C[j]!=0, \\ skip entries already removed 
t = lift( Mod(a,p)^C[jl ); 


if( 1--gcd(t-1, p), 
C[j] = 0; \\ remove entry 
ct -= 1; \\ number of remaining entries 
); 
); 
); 
if ( ct==0, return(a) ); 
); 
25 
return( O ); 
We search all primes of the form p = F -U +1 where F = 100!, U = F — d, and d lies in the range 
1, ..., 1000. Only candidates that are strong pseudoprimes to both bases 2 and 3 are tested: 
f=100!; 


F=factor(f); 
{ for (d=1, 1000, 
u=f-d; 
p = f*uti; 
if ( sppq(p, 2) && sppq(p,3), 
q2 = pocklington_lehmer(F, u); 


printi(d, ": "); 
printi(" ", q2); 
print(); 

); 

) Y 
We find five such primes ~ 8.70978248908948 - 10315 (in about ten seconds): 
d: last a 

45: 103 

TT8: 101 

818: 381 

884: 103 


The returned value a, is the one that did lead to the removal of the last entry in C[]. The value is 
smaller with less prime factors of F. Setting F = 2°°° we find primes (~ 1.07150860718626 - 10301) of the 
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form p = F -U +1 where U = F — d and 1 € d € 3000 for the following d and maximal ay: 


d last a d: last a 
214: 5 1383: 13 
1033: 3 1521: H 
1114: 5 2481: 3 
1321 17 


The search takes about 20 seconds. Discarding candidates that have small prime factors (p < 1,000) 
gives a four-fold speedup. The prime 23340 (23340 — 1633) + 1 e 7.59225935 - 102010 is found within five 
minutes. A further refinement of the test is given in [110], see also [88]. 


39.11.3 Tests for n= k2 +1 
39.11.3.1  Proth's theorem and Pepin’s test 


For numbers of the form p = q- 2‘ +1 with q odd and 2? > q primality can be proven as follows: If there 


is an integer a such that a(»-9/? = — 1, then p must be prime. This is Proth’s theorem. 
The 'FFT-primes' (see section on page|535) are natural candidates for Proth's theorem. For example, 
with p :— 257 . 29 + 1 = 4,179,340, 454, 199, 820, 289 one finds that a(^-/? = — 1 for a = 3, so p must 


be prime. Note that Proth's theorem is the special case F = 2? > k = U of the Pocklington-Lehmer test. 


Numbers of the form 2* +1 are composite unless t is a power of 2. The candidates are therefore restricted 
to the Fermat numbers F,, := 2?" +1. Here it suffices to test whether 3° = — 1 mod F,, where x = 2471; 


pepin(tx)- 
1 


local(t, F, x); 

t 2 tx: 

F 2^t*1; 

x 2^(t-1); 

return( (-1==Mod(3,F)*x) ); 


O0 -1O0» CUu C2 b2 Ee 


Y 
This test is known as Pepin’s test. As shown in section [39.8 on page 781|all non-residues are primitive 


roots modulo prime Fn. Three is just the smallest non-residue. 


for (tx-1,12, print(tx," ^", pepin(tx))) 
11 \\F_1=5 
2 1 MF_2=17 
3 1 \\ F_3 = 257 
4 1 \\ F_4 = 65537 
5 Q 
sas 0 
i2° 0 


No Fermat prime greater than F4 = 65537 is known today and all F„ where 5 € n € 32 are known to be 
composite. 


Note that Fa+ı has (about) twice as many bits as F,. Also the number of squarings (t — 1) involved in 
the test is (about) doubled. If we underestimate the cost of multiplying N-bit numbers as N operations, 
we get a lower bound 4 for the ratio of the costs of testing F,,; and Fp. Assuming the computer 
power doubles every 18 month and Pepin's test of Fy, is just feasible today we'd have to wait three years 
(36 month) before we can test F,41. The computation that proved F24 composite is described in [117]. 


39.11.3.2 What to consider before doing Pepin’s test 


As 2 = — 1 mod F, = 2 + 1 we see that the order of 2 equals 2t = 2"*1, The same is true for factors 
of composite Fermat numbers. When searching factors of F, we only need to consider candidates of the 
form 1 + k2"*?, A routine that searches for small factors of F,, can be implemented as: 


1  ord2pow2(p)- 

2 MN Return the base-2 logarithm of the order of 2 modulo p 
1 e Must have: ord(2)--2^k for some k 

5 local(m, rx); 

6 rx = 0; 

7 m = Mod(2,p); 

8 


while ( m!=1, m*-m; IX++; ); 
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9 return( rx ); 

10 

1 ftrialx(n, mm-10^5, brn=0)= 

2 \\ Try to find small factors of the Fermat number F n-2^(2^n)41 
3 \\ Try factors 1+ps, 1+2*ps, ..., 1+mm*ps where ps=2*(n+2) 
4  \\ Stop if brn factors were found (zero: do not stop) 

5 t 

6 local(p,ps,ttx,fct); 

7 ps = 2°(n+2); MAN factors are of the form 1+k*ps 

8 p = pst; \\ trial factor 

9 ttx = 2^(n*1); NN will test whether Mod(2,p)^ttx-- 

10 fct = 0; \\ how many factors were found so far 

11 for (ct-1, mm, 

12 if ( (Mod(2,p)^(ttx)--1) MAN order condition 

13 && ( (rx-ord2pow2(p)) == n+1 ) \\ avoid factors of smaller Fermat numbers 
14 , /* then */ 

15 printi(n, ": "); 

16 printi(p); 

17 printi(" p-1=",factor(p-1)); 

18 print (); 

19 fctt+; 
20 if ( fct--brn, break() ); 
21 J3 
22 p += ps; 
23 


1 
2 


24 return(fct); 
} 


We create a list of small prime factors of Fa, for 5 < n < 32 where the search is restricted to factors 
f € 14-10? 2??? and stopped when a factor was found: 
for(n=5,32, ftrialx(n, 1075, 1); D); 

5: 641 p-1=[2, 7; 5, 1] 

6: 274177 p-1=[2, 8; 3, 2; 7, 1; 17, 1] 

9: 2424833 p-1=[2, 16; 37, 1] 

10: 45592577 p-1=[2, 12; 11131, 1] 

11: 319489 p-1=[2, 13; 3, 1; 13, 1] 

12: 114689 p-1=[2, 14; 7, 1] 

15: 1214251009 p-1=[2, 21; 3, 1; 193, 1] 

16: 825753601 p-1=[2, 19; 3, 2; 5, 2; 7, 1] 

18: 13631489 p-1=[2, 20; 13, 1] 

19: 70525124609 p-1=[2, 21; 33629, 1] 

23: 167772161 p-1=[2, 25; 5, 1] 

32: 25409026523137 p-1=[2, 34; 3, 1; 17, 1; 29, 1] 


A list for 5 € n € 300 is given in [FXT: data/small-fermat-factors.txt|. Note that an entry 


201: 124569837190956926160012901398286924947521176078042100592562667521 \ 
p-1=[2, 204; 3, 1; 5, 1; 17, 1; 19, 1] 


asserts the compositeness of the number Fo, where Pepin’s test is out of reach by far. Indeed, its binary 
representation could not be stored in all existing computer memory combined: F 91 is a loga (F201) = 
2201 — 3.2138 - 10% -bit number. 


The currently known (partial) factorizations of Fermat numbers are given in [202]. 


39.11.4 Tests for n = k2% — 1 


39.11.4.1 The Lucas-Lehmer test for Mersenne numbers 
Define the sequence H by Ho = 1, Hı := 2 and H; = 4 Hj. , — H;¡-2. The Mersenne number n = 2° — 1 
is prime if and only if H5--» = 0 mod n. The first few terms of the sequence H are 


k: 0 12 3 4 5 6 7 8 9 10 11 12 
Hk: 1 2 7 26 97 362 1351 5042 18817 70226 262087 978122 3650401 


The numbers Hj, can be computed efficiently via the index doubling formula Həp = 2 H? — 1. Starting 
with the value Hı = 2 and computing modulo n the implementation is as simple as 


LL(e)= 
{ 


CON MOTB OO 
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local(n, h); 

n = 2^e-1; 

h = Mod(2,n); 

for (k=1, e-2, h-2*h*h-1); 
return( O--h ); 


} 
? LL(521) 
1 \\ 2^521-1 is prime 
? LL(239) 
O NN 27239-1 is composite 
? LL(9941) 
1 \\ 279941-1 is prime 
7 HH . 
kkk last result computed in 4,296 ms. 


The algorithm is called the Lucas-Lehmer test. Note that most sources use the sequence V = 
2,4, 14, 52, 194, 724, 2702, ... that satisfies the same recurrence relation. We have Hk = > Vk (H is half 
of V). The index doubling relation becomes Vaz = V2 — 2. The sequence of values Həx starts as 


2, 7, 97, 18817, 708158977, 1002978273411373057, 2011930833870518011412817828051050497, ... 
This is entry A002812 in [812]; entry 4003010) gives the values Va»: 
4, 14, 194, 37634, 1416317954, 2005956546822746114, 4023861667741036022825635656102100994, 


The sequence of (currently known) exponents e such that n — 2* — 1 is entry A000043 in [812]: 
2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127, 521, 607, 1279, 2203, 2281, 3217, 

4253, 4423, 9689, 9941, 11213, 19937, 21701, 23209, 44497, 86243, 110503, 

132049, 216091, 756839, 859433, 1257787, 1398269, 2976221, 3021377, 6972593, 13466917 
A few more exponents for Mersenne primes are known: 


20996011, 24036583, 25964951, 30402457, 32582657, 37156667, 42643801, 43112609 
They are not included in the sequence as they might be preceded by currently unknown values. The list 
of exponents is also given in [FXT: ¡mod /mersenne-exponents.cc. 
39.11.4.2 What to consider before doing the Lucas-Lehmer test 
The exponent e of a Mersenne prime must be prime, else n factors algebraically as 


2-1 = [[x2 (39.11-1) 
d\e 


where Y¿(1) is the k-th cyclotomic polynomial (see section |37.1.1 on page 704). For example, with 
2?! — | = 2097151 the following factors are found: 


? m-1; fordiv(21,d, y2=subst(polcyclo(d,x) ,x,2); m*=y2;print(d,": ",y2)); m 
1 


NNW 


: 2359 
2097151 WW == 2721-1 


These factors are not necessarily prime: here 2359 factors further into 7 - 337. More information on the 
multiplicative structure of b° + 1 can be found in [89]. We note the relation 


gcd (2” —1,2"—1) = 29-1 = [[Y42) where g= ged(n,m) (39.11-2) 


It is the special case z — 2, y — 1 of 


god (z^ — y^, z^ —") = -y (39.11-3a) 
= y [Ie = II ^ Yato/y) (39.11-3b) 
d\g d\g 


The relation follows from |37.1-14| on page and the fact that the cyclotomic polynomials Y, and Ym 
are coprime for n Z m. 


&O 00-10» C' i» WMH 
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Before doing the Lucas-Lehmer test one should do a special version of trial division based on the following 
observation: any factor f of m = 2° — 1 has the property that 2^ = 1 mod f. That is, 2° — 1 = 0 mod f, 
so m = 0 mod f and f divides m. We further exploit that possible factors f are of the form 2ek +1 and 
that f = +1 mod 8. The following routine does not try to assert the primality of a candidate factor as 
this would render the computation considerably slower. 


mers trial(e, mct-10^7, bnf=0)= 

\\ try to discover small factors of the Mersenne number 2^e-1 
NW e : exponent of the Mersenne number 

\\ mct : how many factors are tried 

\\ pfq : stop with the factor found (zero: do not stop) 


local(f, fi, ct, fct, m8); 
print("exponent e=",e); 
print("trying up to ", mct, " factors"); 
fi-2*e; \\ factors are of the form 2*e*kt1 
f=1; 
ct=0; 
fct=0; \\ how many factors where found so far 
while (ct < mct, 

f += fi; 

m8 = bitand(f, 7); \\ factor modulo 8 


if ( (1!=m8) && (7!=m8), next(); ); \\ must equal +1 or -1 

if ( Mod(2, f)^e == Mod(1, f), 
print(f, " ^", isprime(f)); AN give factor and tell whether it is prime 
fctt++; 
if ( fct==bnf , break(); ); 

TM 


); 
} 


For m = 219907 _ 1 (3013 decimal digits) we find three factors of which all are prime: 


? e-10007; mers trial(e,,3); 
exponent e-10007 
trying up to 10000000 factors 
240169 1 
60282169 1 
440255213 1 
EK last result computed in 44 ms. 
? ceil((e*log(2.0)/1og(10.0))) 
3013 NX m-2^e-1 has 3,013 decimal digits 


Sometimes one is lucky with truly huge numbers: 


? e-2^31-1; mers trial(e,,1); 
exponent e-2147483647 

trying up to 10000000 factors 
385257526626031 1 


EK last result computed in 583 ms. 
? ceil((e*log(2.0)/1og(10.0))) 
646456993 XX m-2^e-1 has 646,456,993 decimal digits 


? 


Note that we found that m = 2° — 1 is prime if and only if there is no prime f « m where the order of 
2 equals e. A special case is sometimes given as follows: if both p = 4k + 3 and q = 2p + 1 are prime, 
then q divides 2? — 1 (because the order of 2 modulo q equals p). 


By the way, if both p = 4k + 1 and q = 2p + 1 are prime, then q divides 2? + 1 (because the order of 2 
modulo q equals 2 p and 2?» — 1 = (2? + 1) (2? — 1)). 


39.11.4.3 Lucas-Lehmer test with floats 1 
The Binet form (see section |35.1.6| on page [674] of the sequence H,, is 


He : E +13) +(2+ v3) | (39.11-4) 


We can rewrite the expression in the form 


Hy, = ; [exp(z)" + exp(z) ^| = ; [exp(n x) + exp(-n z)] (39.11-5) 
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where a = log(2 + V3). The hyperbolic cosine can be defined as 


1 
cosh(z) = » lexp(z) + exp(—z)] (39.11-6) 
and the expression equals H,, for z = n log(2 + v3). Now we can give a criterion equivalent to the 
Lucas-Lehmer condition as follows: 


cosh pes log (2 ik v3)) = 0 modMn = Mm isprime  (39.11-7) 


The relation is computationally useless because the quantity to be computed grows doubly-exponential 
with m: the number of digits grows exponentially with m. Already for m = 17 the calculation has to be 
carried out with more than 18, 741 decimal digits: 


? cosh(2^ (17-2) *log(2*sqrt(3))) 
1.8888939581139837726097538478056602 E18741 


The program [hfloat: examples/ex8.cc| does the computations in the obvious (insane) way. Using a 


precision of 32,768 decimal digits we obtain: 


cosh(...)= 
+.18888939581139837726097538478056602859465844315551 \ 
. about 18,000 digits ...] 
. 5579750039800680284170000000000000 ... 
^[decimal point after 7] 


00000000000000000000000000000000000000000000000000 \ 


Leca] 
00000000000000001549695720446140150427588985400185472*x10” 18742 
[nonzero due to numerical imprecision] 


After rounding and computing the modulus, the program declares Mj; = 2!’ — 1 prime. All this using 
just 4 MB of memory and computations equivalent to about 35 FFTs of length 1 million, taking about 
four seconds. This is many many million times the work needed by the original (sane) version of the test. 
Even trial division would have been significantly faster. 


The number Msı would need a bigger machine as the computations needs a precision of more than 
300 million digits: 


? (27(31-2)*1og(2*sqrt(3)))/log(10) /* approx decimal digits */ 
307062001.46039800926268312190009204 


Apart from being insane the computation can be used to test high precision floating-point libraries. 


39.11.4.4 The Lucas test 


The Lucas-Lehmer test can be generalized for a less restricted set of candidates. The Lucas test can be 
stated as follows (taken from [284] p.131]): 


Let n = k2* — 1 where k is odd, 2? > k, n 40 mod 3 and k 4 0 mod 3 (so we must have n = 1 mod 3). 
Then n is prime if and only if H(n+1)/4 = 0 mod n where H is as given above. 


To turn this into an efficient algorithm use the relation (n + 1)/4 = k 2*7?. Compute Hj, as described in 


section|35.1.1 on page 666 
0 "i à 


(Mk, His] = (Ho, Hl? 7 


(39.11-8) 


This is a one-liner in GP: 
? H(k)= return( ([1,2] * [0, -1; 1, 4]7k) [1] ); 


? for(k-0,10,print(k,": ",H(k)," = 1/2 * ",2*H(k))) 
0:1 =1/2x* 2 
1: 2 - 1/2* 4 
2: 7 = 1/2 * 14 
3: 26 = 1/2 * 52 
4: 97 = 1/2 * 194 /* = 2*7^2-1 = (1472-2)/2 */ 
5: 362 = 1/2 * 724 
6: 1351 = 1/2 * 2702 /* = 2*2672-1 = (5272-2)/2 */ 
7: 5042 = 1/2 * 10084 
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8: 18817 1/2 * 37634 
9: 70226 1/2 * 140452 
10: 262087 = 1/2 * 524174 


To compute Hj}, »:-» from Hp use (t — 2 times) the index doubling relation Ho; = 2 H? — 1. The test can 


be implemented as 


7 H(k, n)» return( (Mod([1,2],n) * Mod([0,-1; 1,4], n)^k) [1] >; 
3 lucas(k, t)= 

4 4 

5 local(n, h); 

6 /* check preconditions: */ 

7 if ( O--bitand(k,1), return(0) ); \\ k must be odd 

8 if ( k»-2^t, return(0) ); 

9 = k*2^t-1; 

10 if ( n/3!=1, return(0) ); MAN gcd(3,k)!-0 && gcd(3,n)!-0 
11 

12 /* main loop: */ 

13 h = H(k, n); 

14 for (j=1,t-2, h*-h; h+=h; h-=1; ); AN index doubling 

15 return ( O--h ); 

16 $ 


Note that the routine returns ‘false’ even for primes if the preconditions are not met. With n = 5-2!?— 


20479 we obtain 


n-20479 k=5 t-12 
H. (k*2^j) modulo n 


OWWXDIHSUON-O. 
e ee 
KOC1O0100CO000C01-1O» 

ONMNWOWMANENON * 


RB 


which shows that 20479 is prime. Proving n = 5 - 21940 


following code finds the first value t > 2500 so that n = 5 - 2* — 1 is prime: 
k=5; t=2500; while ( 0==lucas(k,t), t+=1; ); t 


Within one second we get the result t = 2548. 


39.11.4.5 Numbers of the form n = 24j +7 and n= 24; +19 1 


— 1 prime takes about ten milliseconds. 


l= 


The 


n: SPP bases a<100,000 (max 5 given) 


1037623: 67191 67192 [--snip--] 
2211631: 6333 7260 8160 16793 21219 21580 2946282799: 
4196191: 9104 26498 93477 3075304399: 
7076623: 3145717759: 
9100783: 3299597407: 
11418991: 44936 3554502799: 
15219559: 3554889199: 
21148399: 4091977039: 
[--snip--] 4207009999: 
829577839: 
ECCE 4899 33982 46674 62180 
961315183: 
1192222639: 
[--snip--] 


1 
2 
3 


Figure 39.11-C: Composite numbers n < 2% of the form n = 243 +7 that pass the Lucas-type test. 


Five of them are strong pseudoprime to some base a « 100, 000. 


Numbers of the form n = 247 + 7 satisfy the preconditions of the Lucas test except for the condition 
that 2' > k where n = k2* — 1. We test whether A(n41)/4 = 0 mod n, as in the Lucas test. Note that 


H, = T,(2) where T, (1) is the n-th Chebyshev polynomial of the first kind. We use the fast algorithm 
for its computation described in section [35.2.3 on page 680| for the test routine: 


bool test 7mod24(ulong n) 
1 


ulong nu = (n+1) >> 2; 
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umod_t t = chebyT2(nu, n); // == chebyT(nu, 2, n); 
return (0--u1); 


Hoe 


The function chebyT2() is given in [FXT:|mod/chebyshevl.cc). Figure|39.11-C] gives composite numbers 


n < 2°? that pass the test. The complete list of such numbers is given in [FXT: data/pseudo-7mod24.txt. , 
there are just 64 entries. Only five entries are strong pseudoprimes to any base a « 100,000, all shown 


in figure|39.11-C 


The data suggests that composites of the form n = 24 j +7 that pass the test and are pseudoprime to a 
small base are extremely rare. The implied test would cover 1/8 of all candidates (that are not divisible 
by 2 or 3), as eight numbers (1, 5, 7, 11, 13, 17, 19, and 23) are coprime to 24. 


n: SPP bases a«100,000 

30739: [--snip--] 
153931: 97917619: 
249331: 100079611: 4820 
1575859: 124134067: 
1960243: [--snip--] 
2557627: 36814 49266 49267 86080 2946282799: 
3444403: 3075304399: 
3767347: 26452 79860 94736 3145717759: 
3881179: 47489 67676 72825 73841 84995 87856 3299597407: 
3882283: 3554502799: 
14324491: 3554889199: 
14970499: 4091977039: 
15894163: 4207009999: 


Figure 39.11-D: Composite numbers n < 2?? of the form n = 24k +19 that pass the Lucas-type test. 
Four of them are strong pseudoprimes to some base a « 100, 000. 


For numbers of the form n = 24 ¡+19 we use a different test: here we check whether U(,,,)/4.; = 0 mod n 
where Uo = 0, U; = 1, and Uk = 4U,_1 — Ux-2 (the Chebyshev polynomial of the second kind, U,, (1), 
evaluated at z — 2). The function for testing is 


bool test 19mod24(ulong n) 

1 
ulong nu = ((n+1) >> 2) - 1; 
umod_t t = chebyU2(nu, n); // == chebyU(nu, 2, n); 
return (0==t); 


QOU WN e 


where the function chebyU2() is given in [FXT: mod/chebyshev2.cc|. The list [FXT: 
19mod24.txt| contains all (155) composites n < 2% that pass the test. An extract is shown in figure 


.11-D| Just four numbers n < 2%? are also strong pseudoprimes to any base a < 100, 000. 
The application of second order recurrent sequences to primality testing is described in [35]: define the 
sequence Wp by 
Wr = PWi-1-QW-, W=0, Wi,=1 (39.11-9) 


Then n is a Lucas pseudoprime (with parameters P and Q) if Wn+1 = 0 mod n, where the sign depends 
on whether D = P? — 4Q is a square modulo n. For both cases considered here we have n = 12 j + 7, 
D = 16 — 4 = 12 = 4- 3, and 3 is not a square modulo n. The test would be (note that W,, = U, 1(2)) 


bool lucas, 7modi2(ulong n) 
1 


ulong nu = n; 
umod_t t = chebyU2(nu, n); 
return (0==t); 


QOU WN =e 


This test is passed by far more composites than the two tests considered before. A primality test 
combining a Lucas-type test and a test for strong pseudoprimality has been suggested in |272|. No 
composite that passes the test has been found so far. 
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39.11.4.6 An observation regarding Mersenne numbers 


An interesting observation is that the following seems to be true: 


—1 


M 22*—1 prime « 3% = -3modM (39.11-10) 
Note that for odd e the condition is equivalent to 3-5/2 = — 1 mod n and 3 is a non-residue. For 
prime exponents e we can see that we are very unlikely to find a composite Me where 3M-0/2 = 
— ] mod n: the number m is a strong pseudoprime (SPP) to base 2 by construction and the right side of 
condition says that m is an SPP to base 3. Given the rarity of composites that are SPP to both 
bases e eC B TD ae TE] the chances of finding such a number among the exponentially 
growing Mersenne numbers are very small. Tony Reix, who observed the statement of relation [39.11-10| 
independently verified it for prime exponents up to 132,499. 


39.11.4.7 Primes that are evaluations of cyclotomic polynomials 


The Mersenne numbers and Fermat numbers are special cases of evaluations of cyclotomic polynomials 
Y, (see section |37.1.1|on page|704). The first numbers Y,,(2) are shown in figure [39.11-E| the sequence 
is entry A019320 in [312]. The sequence of values n such that Y,,(2) is prime is entry A072226 in [312]: 

2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 19, 22, 24, 26, 27, 30, 

31, 32,'33, 34, 38,'40, 42, 46, 49, 56, 61, 62, 65, 69, 77, 78, 80, 85, 86, 

89, 90, 93, 98, 107, 120, 122, 126, 127, 129, 133, 145, 150, 158, 165, 170, 

174, 184, 192,'195,'202,'208, 234, 254, 261, ... 


The powers of two correspond to the Fermat primes. The prime numbers correspond to Mersenne primes. 
The sequence of numbers n such that Y,,(3) is prime is entry A138933 

1, 3, 6, 7, 9, 10, 12, 13, 14, 15, 21, 24, 26, 33, 36, 40, 46, 60, 63, 70, T1, 

72, 86, 103, 108,”130, 132, 143, 145,' 154, 161, 236, 255,°261; 276, 279, 287, ... 
Now set N :— Y,,(2), testing whether N is a base-3 SPP seems to determine primality for all values of 
n ¢ {2,6}. Note that for n a power of 2 the test is Pepin's test. Information about the primality of Y, (2) 
is given in [148]. Theorems about factorizations of Y; (z) where x is an integer are given in [159], see also 
and [76]. The factorization into Gaussian primes is discussed in [128]. 


The primes Y, (2) are also of interest for number theoretic transforms (see section |26.1 on page 535] be- 
cause of their special structure allowing for very efficient modular reduction (see section|39.2 on page 768). 
A prominent example is Yj92(2) = 264 — 2°? + 1. Note that the order of 2 modulo Y,,(2) equals n. 


The structure of the primes becomes (in base 10) visible if we check evaluations at 10, the first primes of 
the form Y,(10) are 


n: Yn(10) 

2: 11 

4: 101 

10: 9091 

12: 9901 

14: 909091 

19: 1111111111111111111 

23: 11111111111111111111111 
24: 99990001 

36: 999999000001 

38: 909090909090909091 

39: 900900900900990990990991 
48: 9999999900000001 


Finally, we do a silly thing: the factors of Yo: 1(x) over GF(2) are the irreducible binary polynomials of 
degree 7. If we evaluate them as polynomials over Z at x = 10 and select the prime numbers we find the 
following 8-digit primes consisting of only zero and ones: 


10011101 10111001 11100101 11110111 11111101 
The same procedure, with Y3s_,(1) and factoring over GF(3) gives the primes 
101221 102101 111121 111211 112111 120011 122021 
'The list is created via 


n-3^5-1; f=lift(factor(polcyclo(n)*Mod(1,3))); f=£[,1]; 
for(k-1, 4f, v-subst(f[k],x,10); if(isprime(v), print(v))); 
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n: s=Yn(2) 

2303 

3: 7 

4: 5 

5:> “3t 

6: 3 

T: 127 

8: 17 

9: 73 

10: 11 

11: 23 * 89 = SPP [11] 
12: 13 

13: 8191 

14: 43 

15: .151 

16: 257 

17: 131071 

18: 3 * 

19: 524287 

20: 5 * 41 

21: 7 * 337 

22: 683 

23: 47 * 178481 

24: 241 

25: 601 * 1801 <--= SPP [29] 
26: 2731 

27: 262657 

28: 29 * 113 

29: 233 * 1103 * 2089 
30: 331 

31: 2147483647 

32: 65537 

33: 599479 

34: 43691 

35: 71 : 122921 

36: 37 109 SPP [17, 19, 23] 
37: 293. * 616318177 

38: 174763 

39: 79 * 121369 

40: 61681 

41: 13367 * 164511353 
42: 5419 

43: 431 * 9719 * 2099863 
44: 397 * 2113 

45: 631 * 23311 = SPP [5] 


46: 2796203 

47: 2351 * 4513 * 13264529 
48: 97 * 673 

49: Sete net 

50: 251 * 405 

51: 103 * 3123 * 11119 

52: 53 * 157 * 1613 

53: 6361 * 69431 * 20394401 
54: 3 * 87211 

55: 881 * 3191 * 201961 
56: 15790321 

57: 32377 * 1212847 

58: 59 * 3033169 

59: 179951 * 3203431780337 


: 61 * 1321 
61: 2305843009213693951 
62: 715827883 
63: 92737 * 649657 
64: 641 * 6700417 
65: 145295143558111 


NN ll pe 


O0WOWXDOHwWNRO 000 OS COND E 


4561 

2 x 17 * 193 
1871 * 34511 

19 * 37 

1597 * 363889 

5 * 1181 

368089 

67 * 661 

47 * 1001523179 
6481 

8951 * 391151 
398581 

109 * 433 * 8209 
29 * 16493 

59 * 28537 * 20381027 

31 * 271 <--= SPP [29] 
683 * 102673 * 4404047 

2 * 21523361 

2413941289 

103 * 307 * 1021 

Ti * 2664097031 


530713 

13097927 * 17189128703 

2851 * 101917 

13 * 313 * 6553 * 7333 

42521761 

83 * 2526913 * 86950696619 

T * 43 * 2269 

431 * 380808546861411923 

5501 * 570461 

181 * 1621 * 927001 

23535794707 

1223 * 21997 * 5112661 * 96656723 
97 * 577 * 769 

491 * 4019 * 8233 * 51157 * 131713 
151 * 22996651 

12853 * 99810171997 

53 * 4795973261 

107 * 24169 * 3747607031112307667 
19441 * 19927 

11 * 1321 * 560099669384411 
430697 * 647753 

229 * 248749 * 1824179209 

523 * 6091 * 5385997 

14425532687 * 489769993189671059 
47763361 

603901 * 105293313660391861035901 
6883 * 22434744889 
144542918285300809 

2 * 926510094425921 

131 * 3701101 * 110133112994711 


- SPP [7] 


Figure 39.11-E: Evaluations s of the first cyclotomic polynomials at 2 (left). 
are Mersenne numbers M,,, entries at n = 2" are Fermat numbers Fy. 


Entries at prime n 
Composites that are strong 


pseudoprimes to prime bases other than 2 are marked with ‘SPP’. The right side shows the corresponding 


data for evaluations at 3. 
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39.11.4.8 Further reading 


Excellent introductions into topics related to prime numbers and methods of factorization are [284], 
[361], 283], and [118]. Primality tests and factorization algorithms are also described in [154]. Some of 
the newer factorization algorithms can be found in [110], readable surveys are [230] and [253]. Tables 
of factorizations of numbers of the form b^ + 1 are given in [89] which also contains much historical 
information. 


A deterministic polynomial-time algorithm for proving primality was published by Agrawal, Kayal and 
Saxena in August 2002 [4]. While this is a major breakthrough in mathematics it does not render the 
Rabin-Miller test worthless. Indeed, ‘industrial grade’ primes are still produced with it, see [SI] (but 
see [270] for ‘counter examples’). Good introductions into the ideas behind the AKS algorithm and its 
improvements are and p.200ff]. 


39.12 Complex modulus: the field GF(p’) 


With real numbers the equation a? = —1 has no solution, there is no real square root of —1. The 
construction of complex numbers proceeds by taking pairs of real numbers (a,b) = a + ib together 
with component-wise addition (a,b) + (c, d) = (a + c, b + d) and multiplication defined by (a,b) (c, d) = 
(ac — bd, ad 4- bc). Indeed the pairs of real numbers together with addition and multiplication as given 
constitute a field. 


We will now rephrase the construction in a way that shows how to construct an extension field from a 
given ground field (or base field). In the example above the real numbers are the ground field and the 
complex numbers are the extension field. 


39.12.1 The construction of complex numbers 


There is no real square root of —1, that is, the polynomial x? + 1 has no real root. The construction 
of the complex numbers proceeds by taking numbers of the form a + bi where i is boldly defined to be 
a root of the polynomial z? + 1. Now observe that if we identify a + bi = bi+ a with the polynomial 
bx +a and use polynomial addition and multiplication modulo the polynomial x? + 1, then we obtain 
the arithmetic of complex numbers. Addition is component-wise, no modular reduction occurs. Now we 
determine the multiplication rule: 


(ba+a)(da+c) = (bd)a? + (ad + bc) x + (ac) (39.12-1a) 
= (ad+bc)x + (ac — bd) (mod z? 4- 1) (39.12-1b) 
We used the relation z? = —1, so uz? = — u (mod z? +1). Identify x with i in the relations to see that 


the complex arithmetic is the polynomial arithmetic of real polynomials modulo the polynomial x? + 1. 


If the ground field is the real numbers, the story comes to an end: every polynomial of arbitrary degree n 
with complex coefficients has exactly n complex roots (including multiplicity). That is, we cannot use 
the given construction to extend the field C: all roots of every polynomial p(x) with coefficients in C lie 
in C. The field C is algebraically closed. 


If we choose the ground field to be F, = GF(p), the integers modulo a prime p, and an irreducible 
polynomial c(x) of degree n whose coefficients are in F5, then we obtain an extension field F,n = GF(p"), 
a finite field with p" elements. The special case of the binary finite fields GF(2") is treated in chapter [42] 


39.12.2 Complex finite fields 


With primes of the form p = 4k + 3 it is possible to construct a field of complex numbers as —1 is a 
quadratic non-residue and so the polynomial x? + 1 is irreducible. The field is denoted by GE(p?). 
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a= 1+1x*i 

av1 = 1+1*i 

a^2-2 0+2xi // (1+x)*(1+x) = x^2+2*x+1 == 2xx+1 - 1 = 2*x+0 == 0+2x*x 
a^3-2 1+2xi // (i*x)*(042*x) = 2*x^242*x == 2x*x - 2 = 2xx-2 == 142*x 
a^4 = 2+0xi // (i*x)*(142*x) = 2*x^243*x41 == 3xx+1 - 2 = 3xx-1 == 2+0*x 
a^b-2 2+2x*i // p ao 

a^6 = 0t1ixi // mod(x”2+1) mod (3) 

a^7 = 241*i == a” (-1)= 2+1x*i 

al8 = 1+0*i == one 

a^°9 = 1+1x*i 

R=maxord==8 == Mat([2, 3]) 

r=ord(a)== 


R/r=1 


Figure 39.12-A: The powers of the element 1 + 1x modulo z? + 1 and p = 3. 


The rules for complex addition, subtraction and multiplication are the ‘usual’ ones. The field has p? 
elements of which R = p? — 1 are invertible. The maximal order equals R, so the inverse of an element a 
can be computed as a^! = a®-! = a? ~?, 


For example, the powers of a = 1 +x = 1 +i modulo c — z? +1 = 0 = 3 + 3i are shown in figure|39.12-A 
Note that the modular reduction happens with both the polynomial x? + 1 and the prime p = 3. The 
polynomial reduction uses z? = —1. 


> 


3 
Dax // (14883) (14383) 
2xx // (1+3x*x)x*x(2+2*x) 
4*x // (1*3*x)* (142*x) 


) 


9xx"2+6xx+1 == 6*x+1 - (9*x49) = 
6*x^248*x42 == 8xx+2 - (6*x46) = 2xx-4 == 142*x 
6xx"2+5xx+1 == b*x*1 - (6*x+6) = 


> 
"ow 


) 


) 


Oxx // mod(x72+x+1) mod(5) 
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Figure 39.12-B: The powers of the element 1 -- 3z modulo z? + x + 1 and p = 5. 


With primes of the form p = 4k + 1 it is also possible to construct a field GF(p?). But we have to use a 
different polynomial as x? + 1 is reducible modulo p and thereby the multiplication rule is different. For 
example, with p = 5 we find that z? + x + 1 is irreducible: 


? p-5; m=Mod(1,p)*(1+x+x"2); polisirreducible(m) 


1 
? a=Mod(1,p)*(1+3x*x) 
? for(k-1,p^2-1,print("a^",k," - ",lift(Mod(a,m)^k))) 


a^1 = Mod(3, 5)*x + Mod(1, 5) 

a^2 = Mod(2, 5)*x + Mod(2, 5) 

a^3 = Mod(2, 5)*x + Mod(1, 5) 

[--snip--] 
The complete list of powers is shown in figure |39.12-B| We see that a = 1+ 3x has the maximal order 
(24), it is a primitive root. The polynomial reduction uses the relation z? = —(x + 1). 


The values of the powers of the primitive root can be used to ‘randomly’ fill a px p array. With a^ = u--z v 
we mark the entry at row v, column u with k: 


[411 9 19 8] 
[10 1 17 14 15] 
[22 3 2 5 13] 
[16 20 7 21 23] 
[-- 24 6 18 12] 
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The position 0,0 (lower left) is not visited. Note row zero is the lowest row. 


As described, the procedure fills a p x p array where p is a prime. With an irreducible polynomial of 
degree n we can fill a p^ x pf x p? x ... x p" array if e+ f+9+... +k — n: For exponents equal 
to 1 choose an arbitrary polynomial coefficient. For exponents h > 1, combine h polynomial coefficients 


2 h—1 
Co; C1, +++; Ch-1 aS Zh = Co + C1 P + C2 p^ t ...Ch ip" 7. 


39.12.3 Efficient reduction modulo certain quadratic polynomials 


The polynomial C = x? + 1 is irreducible for primes of the form 4k + 3 (—1 is not a square) 


(az--b)(Az--B) = (aA)a?--(aB--b A)x 4- (b B) (39.12-2a) 
= (@B+bA)x+(-aA+bB) modz?+1 (39.12-2b) 
= ((a+b)(A+B)—aA—bB)x+(-aA+bB) (39.12-2c) 


The last equality shows how to multiply two complex numbers at the cost of three real multiplications 
and five real additions instead of four multiplications and two additions. 


The polynomial C = x? + d is irreducible if —d is not a square. We have 


(ax +b) (Ax +B) (aB+bA)a+(-daA+bB) moda?+d (39.12-3a) 


((a--b)(A--B)-aA—bB)z-(-daA-bB)  (39.12-3b) 


If the multiplication by d is cheap (for example, if d — 2) the implied technique can be a gain. 


The polynomial C :— z? + x + 1 has the roots (-1 £ v=3) /2 so it is irreducible modulo p if —3 is not a 
square modulo p. The first few such primes p are 
2 5 11 17 23 29 41 47 53 59 71 83 89 101 107 113 131 137 149 167 173 179 191 197 227 233 239 


Multiplication modulo C costs only three scalar multiplications: 


(az--b)(Az--B) = (aA)2*+(aB+bA)x+4 (bB) (39.12-4a) 
= (-aA+aB+bA)r+(-aA+bB) moda?+a2+1 (39.12-4b) 
= ((a—b)(B— A) -bB) x 4- (-CaA--b B) (39.12-4c) 


For the polynomial C = z? + z +d use 


(ar+b)(Ar+B) = (-aA+aB+bA)z+(-daA+bB) modx*+x+d (39.12-5a) 
= ((a—b)(B—A)+bB)x+(-daA+bB) (39.12-5b) 


The polynomial C = x? — x — 1 has the roots (1 zn v5) /2, so it is irreducible modulo p if 5 is not a square 
modulo p. The first few such primes are: 


23 7 13 17 23 37 43 47 53 67 73 83 97 103 107 113 127 137 157 163 167 173 193 197 223 227 233 


Again, multiplication modulo C costs only three scalar multiplications: 


(ax+b)(Ar+B) = (aA)z? --(aB--bA)z (bB) (39.12-6a) 
= (aA+aB+dbA)r+(aA+bB) modz?-z-—1  (39.12-6b) 
((a+b)(A+ B) — b B) z - (aA--b B) (39.12-6c) 


With the polynomial C = x? — z — d use 


(az+b)(Az+B) = (aA+aB+bA)r+(daA+bB) modz?-z-—d  (39.12-7a) 
= ((a+b)(A+ B)—bB)x+(daA+bB) (39.12-7b) 
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For polynomials of the form C = z? — ex — d we have 
(ax+b)(Ar+B) = (aA)a?--(a B--b A) z -- (b B) (39.12-8a) 
= (eaA+aB+bA)r+(daA+bB) modx*—ex—d (39.12-8b) 
= ((a+b)(4A+ B)-[e-1]aA-bB)x+(daA+bB)  (39.12-8c) 


If the multiplications by e — 1 and d are cheap, then the last equality can be useful. For example, with 
the polynomial C = x? — 2x — 1 use 

(az+b)(Az+B) = (2aA+aB+bA)r+(aA+bB) modz?—2z—1 (39.12-9a) 

((a+ b)(A+B)—bB+aA)x+(aA+bB) (39.12-9b) 


With C = x? — 3x + 1 use 
(ax 4- b) (Ax -- B) 


(3aA+aB+dbA)r+(FaA+bB) moda? —3z + 1(39.12-10a) 
((a -- 6) (A+ B) -bB -2a A) + (Xa A6 B) (39.12-10b) 


39.12.4 An algorithm for primitive 2/-th roots 


For primes with the lowest k bits set (p = a —1) (mod 2*)) the largest power of 2 dividing the maximal 
order in GF(p?) equals me = 2*+1;: p=32*—1 with j odd, sop+1=32* and p—1 = j2* — 2 = 
2 (j2F-! — 1), thereby p? — 1 = 2*+1[j (j 2* — 1)] where the term in square brackets is odd. 

An algorithm for the construction of primitive 2/-th roots in GF(p?) for j = 2,3,...,a where 2° is the 
largest power of 2 dividing p? — 1 is given in [149] (and also in ): 


Let uz := 0 and for j > 2 define 


(p+1)/4 

(Cu; +1)/2 if j«a 
uj = (p+1)/4 (39.12-11) 

(j-i — 1)/2) if j=a 

and (for j = 2,3,...,a) 
(p+1)/4 . ' 
(+1 — us) if j«a 

üj I= 7 39.12-12 
i | (-1 — u2) ernie if j—a ) 


where all operations are modulo p. Then u; + iv; is a primitive 2/-th root of unity in GF (p°). 


For example, with p = 127 (and field polynomial x? + 1) we compute 


j: uj vj 

2: ord(0 + i*1) = 

3: ord(8 + i*8) = 

4: ord(103 + T) m 

5: ord(68 + i*87) = 2 
6: ord(15 + i*41) = 64 
T: ord(32 + i*82) = 128 
8: ord(98 + i*38) = 256 


For Mersenne primes p = 2° — 1 one has p? — 1 = (p + 1) (p— 1) = 2° (2° — 2) = 241 (2°71 — 1) =: 2^ k 
where k is odd. The highest power of 2 for which a primitive root exists is 2% where a = e+ 1. This 
checks with our example where p = 127 = 27 — 1. 


39.12.5 Primitive 2/-th roots with Mersenne primes 


For Mersenne primes p = 2° — 1 an element of order 2**! (in ee 2) with field polynomial z? + 1) can be 


constructed more directly: first compute /—3 = 3(^* 0/4 = ? by squaring e — 2 times, then compute 
1/4/2 = 2(€-9/? which does not require modular reduction. E an element of order get is 


d^ dm (143) = 


5 (1 +iv=3) (39.12-13) 


4 
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p^ 131071 = 2717-1 

r3 = 43811 r3°2 = -3 

k: h + i*w = z* (27k) = h + i*r3* u 

0: h + i*w = +256 + i* -56490 = h + i*r3* +256 ord(.) = 2718 
1: h + i*w = +2 + i* +43811 = h + i*r3* +1 ord(.) = 2717 
2: h + i*w = +7 + i* +44173 = h + i*r3* +4 ord(.) = 2716 
3: h + i*w = +97 + i* -36933 = h + i*r3* +56 ord(.) = 2715 
4: h + i*w = +18817 + i* +43903 = h + ixr3x +10864 ord(.) = 2714 
5: h + ixw = -17636 + i* -35524 = h + ixr3x +45327 ord(.) = 2713 
6: h + i*w = -5975 + ix -36232 = h + ixr3x +30114 ord(.) = 2712 
T: h + ixw = -32446 + ix +44887 = h + i*r3* +58666 ord(.) = 2711 
8: h + ixw = -38713 + ix -16371 = h + i*r3* +3123 ord(.) = 2710 
9: h + ixw = +61109 + i* -46595 = h + i*r3* +24597 ord(.) = 2^9 

10: h + ixw = +63110 + ix +25098 = h + i*«r83* -48310 ord(.) = 278 

11: h + ixw = +35245 + ix +14561 = h + i*r3* -3138 ord(.) = 2°7 

12: h + ixw = -30756 + ix -12111 = h + ixr3x +50228 ord(.) = 2^6 

13: h + ixw = -15743 + i* -35732 = h + i*«r83* -19124 ord(.) = 275 

14: h + ixw = -26425 + ix -55712 = h + i*r3* -1910 ord(.) = 274 

15: h + i*w = -256 + i* +256 = h + i*«r83* +18830 ord(.) = 273 

16: h + i*w = O + ix -1 = h + ixr3x* +58294 ord(.) = 272 

17: h + i*w = -1 + ix 0 = h + i*r3* 0 ord(.) = 2^1 

18: h + i*w = +1 + ix 0 = h + i*r3* 0 ord(.) = 2^0 


Figure 39.12-C: Elements of order 27 in GF(p?) where p = 2!” — 1 is a Mersenne prime. 


The result is given in [280] (where a different element z' = v2 + V3 of the same order is used; note that 


J/2= 92^? = 2(e+1)/2), The number 7z’ is sometimes called the Creutzburg- Tasche primitive root as the 
construction is also described in [119] p.200]. We have 2? = 2+ V3 = 2 +i V=3 = Hı +iy-3U,, and 


2 = HaacdvV—3Uaei  for(k>1) (39.12-14) 


Figure |39.12-C| shows the values of the successive 2^-th powers of z in GF(p?) where p = 2!* — 1. 


The sequences H = 2, 7, 97, 18817, ... and U = 1, 4, 56, 10864, ... are those which appear in the Lucas- 
Lehmer test (see section |39.11.4.1 on page 796). The order of 2?" is 2€*1-^. We have Ho; = 2 H? — 1, 
H? — 3U? = 1, and the index doubling formulas for the convergents of the continued fraction of V3: 


Hyg = Hi +30? (39.12-15a) 
Ug = 2HjUj (39.12-15b) 


A method to compute a primitive root is described in [281]: Let c be a primitive root in GF(p), then 
a+ bi is a primitive root in GF (p?) if a? +b? = c mod p (this can always be solved for any c). For c = 3 
a solution is given by a = 2(€-9/? + 1 and b = 2(€- 07? — 1. 


39.12.6 Cosine and sine in GF(p?) 


J: elem. of order 2^j cosine sine 

0: utix*v- 1-i*0 cos= 1 ixsin= 0 
1: uti*v- 126 + i* 0 cos= 126 ixsin= 0 
2: utixv= O + ix 1 cos= 0 i*sin- 1x*i 
3: uti*v- 8 + ix 8 cos= 8 ixsin= 8*i 
4: uti*v- 103 + i*21 cos= 103 i*sin- 21*i 
5: uti*v- 68 + i*87 cos- 68 i*sin- 87*i 
6: uti*v- 15 + i*41 cos= 15 i*sin- 41*i 
T: uti*v- 32 + i*82 cos- 32 i*sin- 82*i 
8: uti*v- 98 + i*38 cos- 38*i i*sin= 98 


Figure 39.12-D: Elements of order 27 and the corresponding sines and cosines in GF (127?). 


Let z be an element of order n in GF(p?), we would like to identify z with exp(27 i/n) and determine 
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the values equivalent to cos(27/n) and sin(2 r/n). We set 


2 24] 
wb un T (39.12-16a) 
n 2z 
2 2-1 
isnt c= f (39.12-16b) 
n 2z 


For this choice of sine and cosine following relations hold: 


exp(r) =  cos(x)- i sin(x) (39.12-17a) 
sin(x)? + cos(x)? = 1 (39.12-17b) 


2741 E 22-1 


Js 53; = 4%. The second can be verified by writing i for some element 


2s 2. 241 2_ 2.1 2 
Age e E = -L 


The first relation is trivial: 


that is the square root of —1: ( 


The quantities corresponding to the 2/-th roots in GF(127?) are shown in figure |39.12-D| Note how the 
i swaps side with the element of highest order 2^. 


The construction shows how to mechanically convert fast Fourier (and Hartley) transforms with explicit 
trigonometric constants into the corresponding number theoretic transforms. The idea of expressing 
cosines and sines in terms of primitive roots was taken from [332]. 


39.12.7 Cosine and sine in GF(p) 


What about primes of the form p = 4k + 1 that are used anyway for NTTs? The same construction 
works. The polynomial x? +1 is reducible modulo p = 4k + 1, so —1 is a quadratic residue and its square 
root lies in GF(p). We could say: i is real modulo p if p is of the form 4k + 1. 


modulus- 257 == 0x101 

modulus is cyclic 

modulus is prime 

bits(modulus)- 8.0056245 == 9 - 0.99437545 
euler_phi(modulus)= 256 == 0x100 == 278 
maxorder= 256 == 0x100 

maxordelem- 3 == Ox 

max2pow= 8 (max FFT length = 2**8 == 256) 
root2pow(max2pow)-3 root2pow(-max2pow)=86 
sqrt(-1) =: i = 241 


8 z- 3 = ( 173 + 87) = (173 + 107*i) 
7 z- 9 -(233 + 33) = ( 233 + 14*i) 
6 z= 81 = ( 123 + 215) = ( 123 + 99*i) 
5 z= 136 = ( 188 + 205) = ( 188 + 196*i) 
4 z= 249 =( 12 + 237) = ( 12 + 194*i) 
3 z= 64 =( 30 + 34) = ( 30 + 30xi) 
2 z= 241 = ( 0 + 241) = ( 0 + 1*i) 
1: z= 256 = ( 256 + 0) = ( 256 + O*i) 
0: z- 1 = ¢ 1 + 0) = ¢ 1 + O*i) 
=1: z= 256 = ( 256 + 0) = ( 256 + O*i) 
-2 z= 16 = ( 0 + 16) = ( 0 + 256*i) 
-3 z= 253 = ( 30 + 223) = ( 30 + 22T*i) 
-4 z= 32 =( 12 + 20) = ( 12 + 63*i) 
-5 z= 240 = ( 188 + 52) = ( 188 + 61*i) 
-6 z= 165 = ( 123 + 42) = ( 123 + 158*i) 
=T z= 200 = ( 233 + 224) = ( 233 + 243*i) 
-8 z= 86 = ( 173 + 170) = ( 173 + 150*i) 


Figure 39.12-E: Roots of order 2f modulo p = 257 = 28 + 1. 


In the implementation [FXT: class mod in mod/mod.h the cosine and sine values are computed from 
the primitive roots of order 2’. The program [FXT: mod/modsincos-demo.cc) generates the list of 22-th 
roots and inverse roots shown in figure|39.12-E 


Again we can translate a routine for the fast Fourier (or Hartley) transform in a mechanical way. 
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An element modulo a prime p = k- 2t + 1 whose order equals 2* can be found by the following algorithm 
even if the factorization of k is not known: Choose a random a where 1 < a < p — 1 and compute s = a*, 


if —1 s? ', then return s, else try another a. 


The algorithm terminates when the first element a is encountered whose order has the factor 2%. An 


implementation that tests a — 2, 3, ..., p — 2 sequentially is 

1 el2(k, t)- 

2 A 

3 local(p, s); 

4 p = k*2^t*1; 

5 for(a-2, p-2, s = Mod(a,p)^k; if( Mod(-1,p)==s" (27 (t-1)), return( s ); ); D; 
6 P 


With p = 314151729239163 - 2° + 1 the algorithm terminates after testing a = 5 (of order (p — 1)/3) and 
returning s = 18583781386455525528042 whose order is indeed 2?6. 


In general, if p = u- f +1, ged(f, u) = 1 and f = [[p; is fully factored, then an element of order f can 
be determined by testing random values a: 


1. Take a random a and set s = a". 
2. If s//P: Æ 1 for all prime factors p; of f, then return s (an element of order f). 


3. Go to step 1. 


39.12.8 Decomposing p= 4k +1 as sum of two squares 


We give algorithms to decompose a prime p = 4k + 1 as a sum of two squares. 
39.12.8.1 Direct computation 


The direct way to determine u and v with n = u? + v? is to check, for v = 0, 1, 2, ..., [Vn], whether 
n — v? is a perfect square. If so, return u = Vn — v? and v: 


sumofsquares naive(n)- 
{ /* return [u,v] such that u^2*v^2--n */ 
local (w) ; 
for (v=0, sqrtint(n), \\ search until n-v^2 is a square 
w = n-v^2; 
if ( issquare(w), return( [sqrtint(w), v] ) ); 
); 


return (0 ); \\ not the sum of two squares 


«e oo-IoOocumÓcorbr- 


} 


The routine needs at most |yn] steps which renders it rather useless for n large. With the prime 
n = 314151729239163-226 -- 1 ~ 2-107? we have n = u?+v? where u = 132599472793 and v = 59158646772 
and the routine would need v steps to find the solution. The method described next finds the solution 
immediately. 


39.12.8.2 Computation using continued fractions 
The square root i of —1 can be used to find the representation of a prime p = 4k + 1 as a sum of two 
squares, p = u? + v?, as follows: 


1. Determine i where i? = —1 modulo p. If i > p/2, then set i = p — i. 


2. Compute the continued fraction of p/i, it has the form [ao, a1, . .., an, 05, ..., a1, ao]. 
3. Compute the numerators of the (n — 1)-st and the n-th convergent, P,-1 and Ph. Return u = P,-1 
and v = P. 
Assume that p = k - 2! + 1 where t > 2. Use an element of order 2° to find a square root of —1: 


1  imag4ki(k, t)- 

2 4 /* determine s such that s^2--1 modulo p=k*2"t+1 */ 
3 local(s); 

4 S = el2(k, t); 


NID OF 


SCO 00-1Oo» 0 iC FLA 


E 
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s = s^(2^(t-2); 
return( s ); 


Now the decomposition as a sum of two squares can be found with 


sumofsquares(k, t)- 
{ /* return [u,v] such that u^2*v^2--p -k*2^t*1 */ 
local(i, s, p, cf, q, u, v); 


i = lift( imag4ki(k, t) ); 
p = k*x2"t+1; 

if ( i»-p/2, i = p-i ); 

cf = contfrac(p/i); 


vector(length(cf)/2, j, cf[j]); 
q = contfracpnqn(cf); 

u = q[1, 1]; v = q[1, 2]; 

return( [u, v] ); 


} 
For example, the relevant quantities with p = 2281 are 


i = 1571 \\ square root of -1 

i = 710 \\ choose smaller square root 

cf = [3, 4, 1, 2, 2, 1, 4, 3] A == contfrac(2281/710) 
cf = [3, 4, 1, 2] NN first half on contfrac 


q = contfracpnqn(cf) = 
[45 16] MM == [P_4, P_3] 
[14 5] \\ == [Q 4, Q. 3]. (unused) 


u-45; v=16; \\ u^2 + v^2 = 2025 + 256 = 2281 
39.12.8.3 A memory saving version 


An algorithm that avoids storing the continued fraction comes from the observation that u and v appear 
in the calculation of gcd(p, i). We use the routine 


gcd. print(p, i)= 

1 
local( t, s ); 
if ( p<i, t-p; p=i; i-t; ); 
S = sqrtint(p); 


print (" M P, " no i); 


} 
For p = 2281 (where i = 710) the following list is produced: 


N 
BRAID 


RPE BOUOHF OE 


030200100. OA 
BRS] 


Pe Bout 
RPWWHUIMDHOG 
^ 
I 
| 
il} 


The marked pair is the first where u? « p. The routine for the decomposition into two squares is 
p p 


sumofsquares gcd(k, t)- 
{ /* return [u,v] such that u^2*v^2--p -k*2^t41 */ 
local(s, p, i, w); 
i = lift( imag4ki(k, t) ); 
p = k*x2"t+1; 
if ( i»-p/2, i = p-i ); 
w = sqrtint(p); 
while ( i, 
if ( p<=w, return( [p,il ) ); 
t-7"p^4i; pei; 1=t; 


; raturn( [0,0] ); NN failure 


Using the relation a? 4-0? = (a-4-ib) (a—ib) we can use the decomposition into two squares to compute the 
factorization of a number over the complex integers. For example, we have 3141592653 = 3- 107 - 9786893 
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(over Z) where the greatest prime factor is of the form 4k + 1. For 9786893 = 2317? + 2102? we find 
3141592653 = —i- 3-107 - (2317 + 21021) - (2102 + 23171). GP has a built-in routine for this task: 
? factor( 3141592653 + O*I ) 

[-I 1] [3 1] [107 1] [2317 + 2102*I 1] [2102 + 2317*I 1] 
If a decomposition n = x? + y? of n is known, then the square roots of —1 can be computed as i = 
+x/y mod mn. For n = z? + dy? we have /—d = x/y mod n, and for n = z^ + dy" we have /—d = 
z/y mod n. For example, for n = 2^ + 1 we know that 4/—1 = 2/1 = 2 mod n, and for n = 27 +3- 57 = 
234503 we have 4/—3 = 2/5 = 46901 mod n. For n = ax" + by* we have */—b/a = z/y mod n. 


39.13 Solving the Pell equation 


Simple continued fractions (see section on page |716) can be used to find integer solutions of the 
equations 
zr —dy = +1 (39.13-1a) 
r’ -dy = -1 (39.13-1b) 
Equation |39.13-1a|is usually called the Pell equation. The name Bhaskara equation (or Brahmagupta- 


Bhaskara equation, used in [221]) has been suggested because Brahmagupta (ca. 600 AD) and Bhaskara 
(ca. 1100 AD) were the first to study and solve this equation. 


39.13.1 Solution via continued fractions 


The convergents P,/Q, of the continued fraction of Vd are close to Vd: (Pk/Qk)? = d. If we define 


ex := P? — d Q?, then solutions of relation |39.13-1a| correspond to e; = +1, solutions of [39.13-1b| to 


Ek = —]. 


As an example we set d = 53. The continued fraction of 4/53 is 


CF(V53) 


[7, 3,1,1,3,14, 31,1, 3574, 3,1, 1,9,14, so] (39.13-2a) 
(7, 3,1,1,3,14] (39.13-2b) 


We observe that the sequence is periodic after the initial term and the last term of the period is twice 
the initial term. Moreover, disregarding the term 14, the terms in the period form a palindrome. These 
properties actually hold for all simple continued fractions of square roots Vd with d not a perfect square, 
for the proofs see [263] or [221]. For the computation of the continued fraction of a square root a 


specialized version of the algorithm from section |37.3.1.2 on page 718|will be most efficient. 


The table shown in figure [39.13-A] gives the first convergents P,/Q;, together with e, := P2 —53Q7. The 
entry for k = 4 corresponds to the smallest solution (x, y) of x? — 53 y? = —1: 1822 — 53.25? = —1. Entry 
k = 9 corresponds to 66249? — 53 - 9100? = +1, the smallest nontrivial solution to z? — 53 y? = +1 (the 
trivial solution is (P_,,Q_1) = (1,0)). 


'The continued fraction for 4/19 is 
CF(V19) = [4 2,1,3,1,2,8, 21,3, ,2,8, 221,3, ,2,8, ...] (39.13-3a) 
= [4 2, 1,3, 1,2, 8] (39.13-3b) 


Its period is | = 6. Figure|39.13-B|shows the corresponding table, it contains solutions with ez = +1 but 
none with e; — —1. 


Let e correspond the minimal nontrivial solution of x? — dy? = +1. If e = +1, then no solution for 
x? — dy? = —1 exists. Nontrivial solutions with e = +1 always exist, solutions with e = —1 only exist 
when the period | of the continued fraction of Vd is odd. The period is always odd for primes of the 
form p = 4k + 1 and never for numbers of the form n = 4k +3 or 4k. If any factor f; of d is of the form 
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k Gk Pr Qk Ek := 
Pg -dQ 
=1: — 0 1 +1 
0: 7 7 1 —4 
1: 3 22 3 +7 
2: 1 29 4 -T 
3: 1 51 7 +4 
4: 3 182 25 -1 
5: | 14 2599 357 +4 
6: | 3 7979 1096 Y 
de 1 10578 1453 +7 
8: 1 18557 2549 —4 
9:| 3 66249 9100 +1 
10: | 14 946043 129949 —4 
11: | 3 2904378 398947 +7 
12: 1 3850421 528896 =7 
13: 1 6754799 927843 +4 
14: 3 24114818 3312425 -1 
15: | 14 | 344362251 | 47301793 +4 


Figure 39.13-A: The first convergents Pk/Qkg of the continued fraction of v53. 


kes Ak Py Qk ek := 
Pg -dQj 
=i s 0 1 +1 
0:| 4 4 1 —3 
1: | 2 9 2 +5 
2: 1 13 3 —2 
3: | 3 48 11 +5 
4: 1 61 14 —3 
5: | 2 170 39 +1 
6: | 8 1421 326 —3 
T:| 2 3012 691 +5 
8: 1 4433 1017 —2 
9: | 3 16311 3742 +5 
10: | 1 20744 4759 —3 
11: | 2 57799 13260 +1 
12: | 8 | 483136 | 110839 —3 


Figure 39.13-B: The first convergents P,/Q, of the continued fraction of v19. 


fi =4k +3, then no solution with e = —1 exists, because this would imply z? = — 1 mod f; but —1 is 
never a quadratic residue modulo f; = 4 k + 3 by relation |39.8-3a on page 782 
However, all prime factors being of the form 4 k-4- 1 does not guarantee that e = —1, the smallest examples 


are 205 = 5- 41, 221 = 13-17, 305 = 5- 61, and 377 = 13- 29. The list of such numbers up to 2500 is 


205, 221, 305, 377, 505, 545, 689, 725, 745, 793, 905, 1205, 1345, 1405, 1469, 1513, 
1517, 1537, 1717, 1885, 1945, 1961, 2005, 2041, 2045, 2105, 2225, 2245, 2329, 2353 


The sequence of numbers d with no factor of the form 4k + 3 such that x? — dy? = —1 has no solution 
is entry A031399 in [312]: 


4, 8, 16, 20, 25, 32, 34, 40, 52, 64, 68, 80, 
100, 104, 116, 128, 136, 146, 148, 160, 164, 169, 178, 194, 
200, 205, 208, 212, 221, 232, 244, 256, 260, 272, 289, 292, 296, ... 


An algorithm for computing solutions (z, y) of the equation Az? — B y? — N is given in [246]. 
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39.13.2 Multiplying and powering solutions 


Consider two solutions (x,y) and (r,s) of the Pell equation 


r’ -dy = e (39.13-4a) 
pods = f (39.13-4b) 

where e = +1 and f = +1. Now write 
Dy = (x + vay) (z 7 vay) (39.13-5) 


and the same for (r,s). We compute the products 
(x + Vay) (r+ Vay) = («r+dys)+Vd (xs yr) (39.13-6a) 
(x- vay) (r- vay) 


By multiplying both relations we see that (U,V) := (xr +dys, «s+ yr) is also a solution: 


(rr--dys) — Vd (xs-- yr) (39.13-6b) 


U? -dV? = ef (39.13-7) 


Now let (r,s) be the smallest nontrivial solution and define (£k, yx) by 


ie = i E (39.13-8) 


Then (£k, yi) is the k-th solution of the Pell equation. 


Let r? — ds? = e, then we have z? — dy? = e*. Therefore, if r? — ds? = +1, then there is no solution 
k Yk 
(x,y) such that 1? — dy? = —1. If r? — ds? = —1, then z? — dy? = —1 for all odd k. 


As we can multiply solutions, we can also raise them to any power. Let (x,y) be such that x? — dy? = e 


where e = +1 and define the matrix M by 


M := l El (39.13-9) 
The k-th power of the solution (zx, y) is 
(zy) = M" H (39.13-10) 


Now write (Xy, Yx ) for the k-th power of (x,y). We have, for the squared solution, 


X; = z?’ +dy? = 22°-e = 2dy +e (39.13-11a) 
Y> = Dey (39.13-11b) 
And for the third power 
Xs = «(a?+3dy’) = z (427 — 36) (39.13-12a) 
Y = y (3a*+dy’) = y (Az? - e) = y (4dy? +3e) (39.13-12b) 


Note that the last equality in the first relation expresses X3 solely in terms of x, d, and e, and the last 
equality in the second relation expresses Y3 solely in terms of y, d, and e. Therefore X3. and Y3« can be 
computed independently. 


Relations |39.13-11a| and |39.13-11b| are the numerator and denominator of the second order iteration 
on page 


for Vd, relation |[29.2-2al Relations |39.13-12a| and |39.13-12b| correspond to the third order 


iteration, relation |29.2-2b 
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If the pair (x,y) is a solution with e = +1, then (T5 (x), y Us 1(x)) is also a solution, where T;, and U,, 
are the Chebyshev polynomials of the first and second kind: 


Th(x) —d(yUn-1(2))? = Tr(a) — dy Us i(2) = (39.13-13a) 
T2 (x) — (2? — 1) U2 (£) 1 (39.13-13b) 


The last equality is relation |35.2-25| on page [683] Similarly, if (x is a solution with e — —1, then 
(T(x), y Ut ,(x)) is also a solution if n is odd, by equation |35.2-29 on page 684| See [190] for much 


more information about Pell's equation. 


II 


39.14 Multiplication of hypercomplex numbers 1 


An n-dimensional vector space (over a field) together with component-wise addition and a multiplication 
table that defines the product of any two (vector) components defines an algebra. 


The product of two elements x = 57, ax €x and y = >, Bj ej of the algebra is defined as 


n—1 


Ley = b» [(ax - 85) mx,;] (39.14-1) 


k,j=0 


The quantities Mk, į = ex e; are given in the multiplication table of the algebra. These can be arbitrary 
elements of the algebra, that is, linear combinations of the components e;. For example, a 2-dimensional 
algebra over the real numbers could have the following multiplication table: 


e0 el 
e0: (5xel + 3*e0) (239*e0 + 3.1415*e1) 
el: (0) (17*e1 + 2.71828*e0) 


Note that there is no neutral element of multiplication (‘one’). Further, the algebra has zero divisors: 
the equation x - y = 0 has a solution where neither element is zero, namely x = es and y = eg. As almost 
all randomly defined algebras, it is completely uninteresting. 
In what follows we will only consider algebras over the real numbers where the product of two components 
equals +1 times another component. For example, the complex numbers are a 2-dimensional algebra (over 
the real numbers) with the multiplication table 

e0 el 


e0: -*eO +el 
el: +el -e0 


Which is, using the symbols ‘1’ and ‘i’, 
1 i 


1: +1 +i 
is. $i. ci 


We will denote the components of an n-dimensional algebra by the numbers 0,1,...,n— 1. The multi- 
plication table for the complex numbers would thus be written as 


0 1 
0: +0 +1 
lu +1 =O 


39.14.1 The Cayley-Dickson construction 


The Cayley-Dickson construction recursively defines multiplication tables for certain algebras where the 
dimension is a power of 2. Let a, A, b and B be elements of a 2"-!-dimensional algebra U. Define the 
multiplication rule for an algebra V (of dimension 2”), written as pairs of elements of U, via 


(a, b): (A, B) := (a-A-B-b*, a*-B+A-b) (39.14-2) 


816 Chapter 39: Modular arithmetic and some number theory 


where the conjugate C* of an element C = (a,b) is defined as 
(a, b) := (a*, —b) (39.14-3) 


and the conjugate of a real number a equals a (unmodified). The construction leads to multiplication 
tables where the product of two units always equals +1 times some unit: e; ej = ep. 


0 il 2 3 A4 B5 6 7 8 9 a b c d e f 
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Figure 39.14-A: Multiplication table for the sedenions. The entry in row R, column C gives the product 
R-C of the components R and C (hexadecimal notation). 


LIA4+A+ TIE N 
I+Ii+ii++ wW 
Ii+tIii+t+t+ O 
I+II ++I+ N 
EIEEEI TE OO 
I++i+i++ © 
HR III ++ pw 
HI+I++1+ C 
lt tlt++t+ Q 
l+++1+1+ A 
++ 144114 © 
TREE ttt H 


IEEE IG GG +ii+i+i+ on 
ELLE TxFAM. +++ ooa 
P+ titit A rb xL OO 
T P xx gl gae 


HOBATMOMO -OOISQONH|O 
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Figure 39.14-B: Signs in the multiplication table for sedenions. 


Figure[39.14-A] gives the multiplication table for a 16-dimensional algebra, the sedenions. The upper left 
8x8 square gives the multiplication rule for the octonions (or Cayley numbers), the upper left 4x 4 square 
gives the rule for the quaternions and the upper left 2 x 2 square corresponds to the complex numbers. 
Note that multiplication is in general neither commutative (only up to dimension 2) nor associative (only 
up to dimension 4). 


The 2”-dimensional algebras are (for n > 1) referred to as hypercomplex numbers. There is no generally 
accepted naming scheme for the algebras beyond dimension 16. We will use the names 2” -ions. 


The form (relation |39.14-2) of the construction is given in [27], an alternative form is used in [135]: 
(a, b) -(A, B) := (a-A— B*-b, b. A*+B-a) (39.14-4) 
It leads to a table that is the transpose of figure|39.14-A 


By construction, e2 = ep, ez = —eo for k £0, eg €k = €k eo = ex, and ej e; = —ej ey whenever both of k 
and j are nonzero (and k # j). Further, 


ekej = te, where x= kXOR j (39.14-5) 


CONDOR MN 


CONDO cb e 
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where the sign is to be determined. o e the pattern of the signs of the sedenion algebra. 
The lower left quarter is the transpose of the upper left quarter, so is the lower right quarter, except for 
its top row. The upper right quarter is (except for its first row) the negated upper left quarter. These 
observations, together with the partial antisymmetry can be cast into an algorithm to compute the signs 


[FX T: aux0/cayley-dickson-mult.h|: 


int CD sign rec(ulong r, ulong c, ulong n) 
// Signs in the multiplication table for the 
// algebra of n-ions (where n is a power of 2) 
// that is obtained by the Cayley-Dickson construction: 
// If component r is multiplied with component c, then the 
7 result is CD sign rec(r,c,n) * (r XOR c). 
1 
if ( (r==0) || (c==0) ) return +1; 
if (cor) 
1 
if (œr ) return -CD sign rec(c, r, n); 
else return -1; // r-- 
// here r>c (triangle below diagonal) 
ulong h = n>>1; 
if (co=h ) // right 
// (upper right not reached) 
return CD sign rec(c-h, r-h, h); // lower right 
} 
else // left 
if ( r>=h ) return CD_sign_rec(c, r-h, h); // lower left 
else return CD sign rec(r, c, h); // upper left 
} 
} 


The function uses at most 2-log,(n) steps. Note that the second row in the table is (the signed version 
of) the Thue-Morse sequence, see section [1.16.4] on page The matrix filled with entries +1 according 
to figure is a Hadamard matrix, see chapter on page [384] 'The sequence of signs, read by 
anti-diagonals, and setting 0 :— + and 1 :— —, is entry in [312]. 


An iterative version of the function is [FXT: aux0/cayley-dickson-mult.h |: 


inline void cp2(ulong a, ulong b, ulong &u, ulong £v) 4 u=a; v-b; } 
// 
inline int CD_sign_it(ulong r, ulong c, ulong n) 
{ 
int s = +1; 
cas ( true ) 
if ( (r==0) || (c==0) ) return s; 
if ( c==r ) return -s; 
if (œr ) { swap2(r,c); s=-s; } 
n >>= 1; 
if ( c>=n ) cp2(c-n, r-n, r, c); 
else if ( r>=n ) cp2(c, r-n, r, c); 
F 
} 


The rate of generation with the computation of all 274 signs in the multiplication table for the ‘212-ions’ 


is about 12 million per second with both routines [FXT: arith/cayley-dickson-demo.cc . 


O 1 2 3 4 5 6 7 


O: +0 +1 +2 +3 +4 +5 +6 +7 +A + ++ c cxx 
1: +1 -0 +6 +4 -3 +7 -2 -5 +-++-+-- 
2: +2 -6 -0 +7 +5 -4 +1 -3 +--++-+- 
3: +3 -4 -7 -0 +1 +6 -5 +2 +---++-+ 
4: +4 +3 -5 -1 -0 +2 +7 -6 ++---++- 
5: +5 -7 +4 -6 -2 -0 +3 +1 +-+---++ 
6: +6 +2 -1 +5 -7 -3 -0 +4 ++-+---+ 
7: +7 +5 +3 -2 +6 -1 -4 -0 +++-+--- 


Figure 39.14-C: Alternative multiplication table for the octonions (left) and its sign pattern (right). 
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An alternative multiplication table for the octonions is given in figure |39.14-C} Its sign pattern is the 
8 x 8 Hadamard matrix shown in figure |19.1-A| on page Properties of this representation and the 
relation to shift register sequences are given in [125]. 


39.14.2 Fast multiplication of quaternions 


+-- 0 1 2 3 t= 0 1 2 3 +-- 0 1 2 3 
| | | 

0: 0 1 2. 3 0: -0 1 2 3 10: Ox 1 2 3 
1: 1 0 3 2 1: 1 -0 3 2 i1: 1 -0 3 -2x* 
2: 2 3 0 1 2: 2 3 -0 1 j 2: 2 -3* -0 1 
3: 3 2 1 0 3: 3 2 1 -0 k 3: 3 2 -1* -0 


Figure 39.14-D: Scheme for the length-4 dyadic convolution (left), same with bucket zero negated 
(middle) and the multiplication table for the units of the quaternions (right). The asterisks mark those 
entries where the sign is different from the scheme in the middle. 


-0 1 2 3 4 5 6 7 0123 4 56 7 #0 1 2 3 4 5 6 7 
1-0 3 2 5 4 7 6 1-0 3-2 5-4-7 6 1 0 3#2 5 #4 #7 6 
2 3-0 1 6 7 4 5 2-3-0 1 6 7 -4 -5 2#3 0 1 6 7 #4 #5 
3 2 1-0 7 6 5 4 3 2-1-0 7-6 5-4 3 2#1 0 7 #6 5 #4 
4 5 6 7 -0 1 2 3 4 -5 -6-7 -0 1 2 3 4 #5 #6 #7 0 1 2 3 
5 4 7 6 1 -0 3 2 5 4-7 -6 =i -0-3. 2 5 4#7 6 #1 O #3 2 
6 7 45 2 3-0 1 6 7 4-5 -2 3 -0 -1 6 7 4 #5 #2 3 O #1 
7T 6 54 3 2 1-0 ff 6. 5.4 3-2 1 -0 7#6 5 4 #3 #2 1 0 


CONDOR mn 


oR UNA 


Figure 39.14-E: Scheme for the length-8 dyadic convolution with bucket zero negated (left) and mul- 
tiplication table for the octonions (middle, taken from [135]). There are 22 places where the signs differ 
(right, marked with ‘#’). This leads to an algorithm involving 8 + 22 = 30 multiplications. 


Quaternion multiplication can be done with "Bs real multiplications using the dyadic convolution (see 


section|23.8 on page 481). The scheme in figure|39.14-D|suggests using the dyadic convolution with bucket 


zero negated as a starting point which costs four multiplications. Some entries have to be corrected which 
costs four more multiplications. 


// £[] == [ rei, ii, ji, ki] 


// gl] == [ re2, i2, j2, k2 ] 

cO := £[0] * g[0] 

ci :- f[3] * g[2] 

c2 := f[1] * g[3] 

c3 := £[2] * g[1] 

// length-4 dyadic convolution: 
walsh(f[]) 

walsh(g[]) 

for i:=0 to 3 gli] := (f[i] * glil) 
walsh(g[]) 


// normalization and correction: 


glo] := 2*c0- glo] / 4 
gl1] :=- 2 * c1 + gli] / 4 
g[2] := - 2 * c2 + gl2] / 4 
g[3] := - 2 * c3 + g[3] / 4 


The algorithm is taken from [187] which also gives a second variant. 


The complex multiplication by three real multiplications (relation |39.12-2c on page 806) corresponds to 
one length-2 Walsh dyadic convolution and the correction for the product of the imaginary units: 


// £O == [ rei, imi ] 

// gl] == [ re2, im2 ] 

cO :- f[1] * g[1] // == imi * im2 

// length-2 dyadic convolution: 

{ #[0], £[1] + := 4 £[0] + £[1], f[0] - £[1] > 
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{ g[0], g[1] + := { glo] + gli], glo] - gli] > 


g[0] := £[o] 
811] £[1] 


* g[0] 
* g[1] 


{ g[0], g[1] + := £ glo] + gli], glo] - gl1] > 


// normalization: 


f [0] 
gLo] 


:= £[0] 
:= g [0] 


/ 2 
/ 2 


// correction: 
g[0] := -2 * cO + glo] 


// here: 


gl] == [ rei * re2 - imi * im2, 


rei * im2 + imi * re2 ] 
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For complex numbers of high precision multiplication is asymptotically equivalent to two real multipli- 
cations as one FFT-based (complex linear) convolution can be used for the computation. Similarly, high 
precision quaternion multiplication is as expensive as four real multiplications. Figure |39.14-E| shows an 


equivalent construction for the octonions leading to an algorithm with 30 multiplications. 


39.14.3  Eight-square identity 1 


{c=[ 

+BO, +B1, +B2, 
+B1, -BO, -B3, 
+B2, +B3, -BO, 
+B3, -B2, +B1, 
+B4; «B5, +B6, 
+B5, -B4, «BT, 
+B6, -B7, -B4, 
+B7, «B6, -B5, 

]; 3 

A= 

B= 


[ *A0, +A1, +A2, +A3, +A4, +A5, +A6, +A7 ] 
[ +BO, +B1, +B2, +B3, +B4, +B5, +B6, +B7 ] 


Figure 39.14-F: Symbolic matrix and vectors used with four and eight squares theorems. 


Define the matrix C and vectors A and B as shown in figure|39.14-F| (compare to figure |39.14-A). With 
P := C A we have the following eight-square identity 


n—1 

2 
>, Pi 
k=0 


(39.14-6) 


k=0 k=0 
The components of P are 
PO = + A7*B7 + A6*B6 + A5x*B5 + A4*B4 + A3*B3 + A2*B2 + A1*B1 + AO*BO 
Pi = + A6*B7 - AT*B6 - A4*B5 + A5*B4 - A2*B3 + A3*B2 + AO*B1 - A1*BO 
P2 = - A5x*B7 - A4*B6 + AT*B5 + A6*B4 + A1*B3 + AO*B2 - A3*B1 - A2*BO 
P3 = - A4*B7 + A5*B6 - A6*B5 + A7*B4 + AO*B3 - A1*B2 + A2*B1 - A3*BO 
P4 = + A3*B7 + A2*B6 + A1*B5 + AO*B4 - AT*B3 - A6*B2 - A5*B1 - A4*BO 
P5 = + A2*B7 - A3*B6 + AO*B5 - A1*B4 + A6*B3 - AT*B2 + A4*Bi - A5*BO 
P6 = - A1*B7 + AO*B6 + A3*B5 - A2*B4 - A5*B3 + A4*B2 + A7*B1 - A6*BO 
P7 = + AO*B7 + A1*B6 - A2*B5 - A3*B4 + A4*B3 + A5*B2 - A6*B1 - A7*BO 
The given equality also holds if matrix and vectors are truncated to length 4 (four-square identity), 2, or 
1 (trivially): 
? n-4; An=vector(n,k,A[k]); Bn=vector(n,k,B[k]); Cn=matrix(n,n,r,c,C[r,c]); 
? Pn-Cn*An^; 
? ti-sum(k-1,n,Pn[k]^2); 
? t2-sum(k-1,n,An[k]^2) * sum(k=1,n,Bn[x]72); 
7T z-ti-t2 
o NN OK 
With length 16 the difference of the left and right side of|39.14-6| has 168 terms: 
4*( 
+A01*A10*BO4*B15 -AO1*A10*BO6*B12 -AO1*A10*BOG*B13 +A01*A10*BO7*B13 
+A01*A12*BO3*B14 +A01*A12*BO7*B10 +A01*A13*BO2*B14 +A01*A13*BO3*B15 
+A01*A14*BO4*B11 +A01*A14*BO5*B10 +A01*A15*BO2*B12 +A01*A15*BO5*B11 
+A02*A09*BO5*B14 
sh. isis 
-AO1*A10*BO4*B14 -AQ1*A10*BO5*B14 -AO1*A10*BOb*B15 -A01*A10*BOT7*B12 
-A01*A12*BO2*B15 -A01*A12*BO6*B11 -AQ1*A13*BO6*B10 -A01*A13*BOT7*B11 
-AO1*A14*BO2*B13 -AO1*A14*BO3*B12 -AO1*A15*BO3*B13 -A01*A15*BO4*B10 
-A02*A09*B06*B13 


p eS 


820 Chapter 39: Modular arithmetic and some number theory 


39.14.4 Simple zero-divisors of the sedenions ¢ 


1: (€1+10)¥%* (4- 15) 29: (2*11) * (4 - 13) 57: (3 +14) * (4+ 9) 
2: C1* 100) * ( 5 * 14 ) 30: (2411) * (5+ 12) 58: (3 +14) * (7 - 10) 
3: (1*4 10) * (6-13) 31: (2* 11 ) * (6+ 15) 59: (3 +15)%* (5+ 9) 
4: C1* 10) * (7+ 12) 32: (2+11)x* (7-14) 60: (3 +15) * (6+ 10) 
5: (1+11)%* ( 4 * 14) 33: (2* 12) * (3 * 13) 61: (4+ 9)x* (6-11) 
6: C1*11) * (5 + 15) 34: (2* 12) * (5-11) 62: (4+ 9) * ( 7 * 10) 
7: (1*11) * (6 - 12) 35: (2* 12) * (7+ 9) 63: (4+ 100) * (5 * 11) 
8: (1*4 11) * (7-13) 36: (2* 13) * (3-12) 64: (4+ 10)%* (7- 9) 
9: (1* 12) * (2 * 15) 37: (2* 13) * (4 * 11 ) 65: (4-11) * (5- 10) 
10: (1*4 12) * (3- 14) 38: (2*13) * (6- 9) 66: (4*11) * (6+ 9) 
11: (1*4 12) * (6+ 11) 39: (2*14 ) * (3-15) 67: (4* 13) * (6+ 15 ) 
12: (1+12)x* (7-10) 40: (2+14)x*(5+ 9) 68: (4+ 13) * (7-14) 
13: (1+413)* (2- 14 ) 41: (2+ 14) * (7+ 11) 69: (4+ 14) * (5-15) 
14: (1+413)* (3-15) 42: (2* 15) * (3+ 14 ) 70: (4+ 14 ) * ( 7 + 13) 
15: C 1*4 13) * ( 6 * 10 ) 43: (2+15)x*(4- 9) 71: (4* 15 ) * (5+ 14 ) 
16: C1* 13) * (7 * 11) 44: (2* 15) * (6-11) 72: (4* 15) * (6-13) 
17: (1*4 14) * (2+ 13) 45: (3+ 9) *(4- 14) 73: (5+ 9)*(6- 10) 
18: (1*4 14) * (3+ 12) 46: (3+ 9)x* (5 - 15) 74: (5+ 9) * (7-11) 
19: (1+14)x* (4-11) 47: (3+ 9)*(6+12) 75: (5+10)* (6+ 9) 
20: (1+14)x* (5-10) 48: (3+ 9) * (7+ 13) 76: (5+11)%* (7+ 9) 
21: (1+ 15) * (2 - 12) 49: (3 +10) * (4+ 13) 77: (5 +12) * (6-15) 
22: (1+ 15) * (3 + 13) 50: (3 +10) * (5-12) 78: (5+12) * (7+ 14 ) 
23: (1* 15 ) * (4 * 10) 51: (3 +10) * (6-15) 79: (5+14)x* (7 - 12) 
24: (1+15)x* (5-11) 52: (3 +10) * ( 7 * 14 ) 80: (5* 15) * (6 + 12) 
25: (2+ 9) * (4+ 15) 53: (3 +12) * (5+ 10) 81: (6+10)x* (7-11) 
26: (2+ 9)x*x* (5-14) 54: (3+12)x* (6- 9) 82: (6+11)* (7+10) 
27: (2+ 9) * (6+ 13 ) 55: (3 +13) * (4-10) 83: (6* 12) * (7-13) 
28: (2+ 9)*(7- 12) 56: (3 +13)%* (7- 9) 84: (6+ 13) * (7 + 12) 


Figure 39.14-G: Products of simple zero-divisors of the sedenions. 


An element U # 0 such that there is an element V Z 0 and either UV = 0 or VU = 0 is called a 
zero-divisor. The simplest zero-divisors of the sedenions are sums or differences of two units, we call 
these simple zero-divisors. Figure [39.14-G] gives products of simple zero-divisors where the first factor is 
a sum of units in symbolic form. For every entry (a + b) (c+ d) there is another product (a — b) (c F d). 
All products remain zero when the factors are swapped. The list was created with the program [FXT: 


arith/zero-divisors-demo.cc|, see also [FX T: data/sedenion-zero-products.txt, . 


Let (a,b) be a pair of indices such that neither a nor b have all three lowest bits zero, the highest bits of 
a and b are different, and a and b do not coincide in all three lowest bits. Then all elements U = (+a +b) 
are zero divisors and both equations U V = 0 and WU = 0 have a solution where V and W are elements 
of the same type. 


If (+a + b) (+c + d) = 0 then (a XOR b) = (c XOR d). The converse is not true. 


There are 42 zero-divisors that are sums of two units, appearing as either left or right factor in figure 


39.14-G| (they are listed in [FXT: data/sedenion-zero-divisors.txt|). If (+a + b) is a zero-divisor then all 


4 of (ta + b) are zero-divisors, so there are 168 simple zero-divisors. 


'There are zero-divisors for all Cayley-Dickson algebras with at least 2" — 16 elements (and none for less 
than 16 elements). The sequence of numbers of zero-divisors of the form (a+b) is entry A167654 in [312]: 


n: 0,1,2,3, 4, 5 6 7 8 
zd: 0, 0? 0? 02 42, 294, 1518, 6942, 29886, 


The sequence can be computed as follows: 


? v=vector(14); v[4]=42; for(k=5,#v, v[k]=2*v[k-1]+(27 (k-1)-1)*(2^ (k-1)-2) ); v 
[0, 0, 0, 42, 294, 1518, 6942, 29886, 124542, 509694, 2064894, 8317950, 33400830, 133885950] 


The pairs of units leading to zero-divisors of the 64-ions are shown in figure|39.14-H| The upper left 8 x 8 
entries are zero as there are no zero-divisors for the octonions. The simple zero-divisors of the sedenions 


appear in the upper left 16 x 16 matrix. The matrix is from [FXT: |data/zero-divisor-structure.txt . 
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b) are zero-divisors. 


Figure 39.14-H: Matrix indicating the pairs (a,b) of units such that all 4 of (+a 
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Chapter 40 


Binary polynomials 


We introduce binary polynomials and their arithmetic. Tests for irreducibility, primitivity, and a method 
for factorization are given. Many of the algorithms shown can easily be implemented in hardware. An 
important application is the linear feedback shift registers, described in chapter The arithmetic 
operations with binary polynomials are the underlying methods for computations in binary finite fields 
which are treated in chapter [42] 


A polynomial with coefficients in the field GF(2) = Z/2Z (that is, ‘coefficients modulo 2’) is called a 
binary polynomial. The operations proceed as for usual polynomials except that the coefficients have to 
be reduced modulo 2. To represent a binary polynomial in a binary computer we use words where the 
bits are set at the positions where the polynomial coefficients are one. We use the convention that the 
coefficient of z* appears at bit k, so the constant term lies at the least significant bit. 


40.1 The basic arithmetical operations 


Addition of binary polynomials is the XOR operation. Subtraction is the very same operation. 


Multiplication of a binary polynomial by its independent variable x is simply a shift to the left. 


40.1.1 Multiplication and squaring 


Multiplication of two polynomials A and B is identical to the usual (binary algorithm for) multiplication, 
except that no carry occurs [FXT: bpol/bitpol-arith.h : 


inline ulong bitpol mult(ulong a, ulong b) 
// Return Ax B 
1 


OOND Ib. A 
z 
5 
p 
H 
[o] 


11 return t; 


12 7 


As for integer multiplication with the C-type unsigned long, the result will silently overflow if 
deg(A) + deg(B) is equal to or greater than the word length (BITS, PER. LONG). If the operation t^-a; 


was replaced with t+=a; the ordinary (integer) product would be returned [FXT: |gf2n/bitpolmult- 
ee see figure |40.1-A| (top). When a binary polynomial p — sm aj £? is squared, the result 
equals p? — > m ak z?*, figure |40.1-A| (bottom). So we just have to move the bits from position k to 
position 2k: 


inline ulong bitpol. square(ulong a) 
H Return A * A 


ulong t = 0, m = 1UL; 
while (a ) 


NOJ AWN e 


if ( aki) t ^= m; 
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1..11.111 * 11.1.1 
product as bitpol 


ordinary product 


1..11.111 : o 1..11.111 Cri iets 1..11.111 
1..11.111 l t= ...1.1111.1.11 c= 11 .1..11 
1..11.111 1 t- .1.11.1..11.11 c- 11..11..... 11 
Lede TdT tes 1 t= 11..... 1111.11 Cc heiter 14.311 
1..11.111 * 1..11.111 
product as bitpol ordinary product 
1..11.111 1 Escala 1..11.111 EA stesse s de... 11 
1..11.111. 1 rd pt 11.1.11..1 CLR ves dus 111.1 .1 
1..11.111.. j TE cand Sis 1111....1.1 A A ss 1 
Lo 011. 111. 1 t= Lede 1111.1.1 Cm vius 11.111111...1 
Lee la [le a oe E t= 2.1.1 c= 1 1.11.1...1 
0 
PL E Ses eet 1 Eli Medardo ll c= .1.1111..111.1...1 


100=]DODOYa WN =e 


Figure 40.1-A: Multiplication (top) and squaring (bottom) of binary polynomials and numbers. 


} 


40.1.2 Optimization of the squaring and multiplication routines 


m <<= 2; 
a >>= 1; 


return 


t; // == bitpol_mult(a, a); 


The routines for multiplication and squaring can be optimized by partially unrolling which avoids 
branches. As given, the function is compiled to: 


xor hecx,hecx //t=0 
ff test ^rdi,4rdi // a 
00 00 00 mov $0xi,4edx //m=1 
je 27 <_Z13bitpol_squarem+0x27> // a==0 ? 
c8 mov 4rcx,4Arax // tmp = t 
do xor 4Ardx,4rax // tmp “=m 
c7 O1 test $0xi,/dil // if ( a&i ) 
45 c8 cmovne Arax,Arcx // then t = tmp 
e2 02 shl $0x2,4rdx // m <<= 2 
ef shr 4rdi //a>>=1 
jne 10 <_Z13bitpol_squarem+0x10> // a!=0 ? 
c8 mov Arcx, rax 
retq 


The if-statement does not cause a branch so we unroll the contents of the loop 4-fold. We also move the 
while() statement to the end of the loop to avoid the initial branch: 


inline ulong bitpol_square (ulong a) 


{ 


} 


ulong t = 0, m = 1UL; 


B 
A 
A 
Il 
N 
w 


NA NA NA 
> ll " il} ii il} 
Rh e e 


I 
we Bee Bee HB 


Vet Vet Vet Vet 


v 
ul 
m 


} 
while (a); 


return 


t; 


Now we obtain machine code that executes much faster: 


0: 


31 c9 


xor hecx,hecx // t 


=0 


1 


Oowkwnwmr 
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2 ba 01 00 00 00 mov $0xi,4edx //m= 1 

7: 48 89 c8 mov frcx,frax // tmp = t 

a: 48 31 dO xor 4Ardx,4rax // tmp “=m 
d: 40 f6 c7 01 test $0x1,%dil // if ( a&i ) 
11: 48 Of 45 c8 cmovne %rax,frcx // then t = tmp 
15: 48 c1 e2 02 shl $0x2,%rdx // m <<= 2 
19: 48 di ef shr 4rdi //a>>=1 
jc: 48 89 c8 mov 4rcx,^rax 

1f: 48 31 dO xor ^rdx,^rax 

22: 40 f6 c7 01 test $0x1,%dil 

26: 48 Of 45 c8 cmovne %rax, 4rcx 

2a: 48 c1 e2 02 shl $0x2,%rdx 

2e: 48 di ef shr /rdi 

31: 48 89 c8 mov /rcx,/rax 

[--snip--] 
43: 48 di ef shr 4rdi 
46: 48 89 c8 mov Arcx, rax 

[--snip--] 

58: 48 di ef shr 4rdi 

5b: 75 aa jne 7 <_Z13bitpol_squarem+0x7> // a!-0 ? 
5d: 48 89 c8 mov Arcx, rax 

60: c3 retq 


The multiplication algorithm is optimized in the same way. For squaring we can also use the bit-zip 
function given in section on page 


inline ulong bitpol_square(ulong a) ( return bit_zip0( a ); + 


The upper half of the bits of the argument must be zero. 


40.1.3 Exponentiation 
With a multiplication (and squaring) function at hand, it is straightforward (see section|28.5 on page 563) 
bpol/bitpol-arith.h 


to implement the algorithm for binary exponentiation [FXT: 


inline ulong bitpol_power(ulong a, ulong e) 
d Return A ** e 

if ( O==e ) return 1; 

ulong s = a; 

while ( 0==(e&1) ) 

{ 


S = bitpol_square(s) ; 
e >>= 1; 


} 


a = S; 
naite ( O!=(e>>=1) ) 


S = bitpol_square(s) ; 
if (e& 1) a = bitpol_mult(a, s); 
} 


return a; 


Note that overflow will occur even for moderate exponents. 


40.1.4 Quotient and remainder 


The remainder a modulo b can be computed by initializing A = a and subtracting B = 27 - b with 


deg(B) = deg(A) from A at each step. The computation is finished as soon as deg b > deg A. As C-code 
[FXT: bpol/bitpol-arith.h!: 


inline ulong bitpol_rem(ulong a, ulong b) 
// Return R= A % B= A - (A/B)*B 

// Must have: B!=0 

{ 


const ulong db = highest_one_idx(b) ; 
ulong da; 
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while ( db <= (da=highest_one_idx(a)) ) 


if ( 0--a ) break; // needed because highest_one_idx(0)==highest_one_idx(1) 
a ^= (b««(da-db)); 
} 


return a; 


} 


The function highest_one_idx() is given in section |1.6 on page 14| The following version may be 
superior if the degree of a is small or if no fast version of the function highest_one_idx() is available: 


COND OUR Mn 


OBNADUSWNr 


NOOBRWNER 


qus (b<=a) 
ulong t = b; 
while ( (at) >t) t <<= 1; 
// ="= while ( highest_one(a) > highest_one(t) ) t <<= 1; 
a ^= t; 


} 


return a; 


The quotient and remainder of two polynomials is computed as follows: 


inline void bitpol_divrem(ulong a, ulong b, ulong &q, ulong &r) 
// Set R, Q so that A == Q * B * R. 
A Must have B!=0. 
const ulong db = highest_one_idx(b); 
q = 0; // quotient 
ulong da; 
while ( db <= (da=highest_one_idx(a)) ) 
{ 
if ( 0==a ) break; // needed because highest_one_idx(0)==highest_one_idx(1) 
a ^= (b««(da-db)); 
q ^» (AUL<<(da-db)) ; 
} 
r=a; 


The division routine does the same computation but discards the remainder: 


inline ulong bitpol_div(ulong a, ulong b) 
// Return Q=A/B 
d Must have B!-0. 


[--snip--] // identical code 
return q; 


} 


40.1.5 Greatest common divisor (GCD) 


The polynomial greatest common divisor (GCD) can be computed with the Euclidean algorithm [FXT: 
bpol/bitpol-gcd.h|: 


ONDT ob. r- 


inline ulong bitpol gcd(ulong a, ulong b) 
// Return polynomial gcd(A, B) 
1 
if ( a«b ) { ulong t-a; a=b; b-t; } // swap if deg(A)«deg(B) 
// here: b«-a 
Thule ( 0!-b ) 
ulong c - bitpol rem(a, b); 
a=b; 
b= Cc; 
} 
return a; 
} 


Note that the comment 
if (a«b) { ulong t=a; a=b; b=t; ); // swap if deg(A)<deg(B) 


is not strictly correct as the swap can also happen with deg(a) = deg(b) but that does no harm. 


The binary GCD algorithm can be implemented as follows: 
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1 inline ulong bitpol_binary_gcd(ulong a, ulong b) 

2 

3 if (a«b) { ulong t=a; a=b; b-t; 3; // swap if deg(A)<deg(B) 
4 if ( b==0 ) return a; 

à ulong k = 0; 

7 while ( !((alb)&1) ) // both divisible by x 

8 


{ 
9 k++; 
10 a >>= 1; 
11 b»»-1; 
12 } 
13 while ( !(a&1) ) a >>= 1; 
T while ( !(b&1) ) b >= 1; 
16 while ( a!=b ) 
17 { 
18 if (a«b) i ulong t=a; a=b; b-t; }; // swap if deg(A)<deg(B) 
19 ulong t = (a^b) >> 1; 
20 while ( !(t&1) ) t >= 1; 
21 a=t; 
22 } 
31 return a << k; 
25 4} 


With a fast bit-scan instruction we can optimize the function: 


1 inline ulong bitpol_binary_gcd(ulong a, ulong b) 

2 A 

3 if ( (a==0) || (b==0) ) return alb; // one (or both) of a, b zero? 
4 

5 ulong ka = lowest_one_idx(a) ; 

6 a >>= ka; 

7 ulong kb = lowest_one_idx(b) ; 

8 b >>= kb; 

9 ulong k = ( ka<kb ? ka : kb ); 

10 

11 while ( a!=b ) 

12 

13 if (a«b) { ulong t=a; a=b; b-t; ) // swap if deg(A)«deg(B) 
14 ulong t = (a^b) >> 1; 

15 a = (t >> lowest_one_idx(t)); 

16 } 

17 return a << k; 

18 P 


40.1.6 Exact division 


Let C be a binary polynomial in x with constant term 1. We use the relation (for power series) 


1 1 "T | e i 
A = Gey SO PUY S) 4 KEYS cU RET) aod a (40.1-1) 


where Y = 1— C. Now let Y = z^ + x° +... + z** where e; > 1 and e;41 > e;. Then Y? = 
eter 4 12 +... 4+ z?** whenever q is a power of 2, and the multiplication by (1 — Y?) is done by shifts 
and subtractions. If A is an exact multiple of C, then R = A/C is a polynomial that can be computed 
as follows. We assume that arrays of N bits are used for the polynomials. 


1. Set R := A and let e; (for i = 1,2,...,k) be the (ordered) positions of the nonzero coefficients of 
C. Set q :— 1. 


2. If qe; > N, then return R. 


3. Set T := 0. For j = 1,2,...,k, set T := T + Rx. The multiplications with «7 are left shifts by 
qe; positions. Set R := T. 


4. Set q := 2q and goto step 2. 


'The method is most efficient if k, the number of nonzero coefficients of C — 1, is small. Sometimes we can 
reduce the work by dividing by C D and finally multiplying by D for some appropriate D. For example, 
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with all-ones polynomials C=1+a+2?+...+a* and D = 1 + z, then C D = 1 + z**. If C is of the 
form z” (1+...+2*), then A/C can be computed as (A/z")/(C/z"). 


The simplest example is C = 1 +x where the above procedure reduces to the inverse reversed Gray code 
given in section|1.16.6 on page 45 


1 inline ulong bitpol_div_xp1(ulong a) 
2 // Return power series A / (x*1) 
3  // If A is a multiple of x+1, then the returned value 
4  // is the exact division by x*1 
5 d 
6 a “= a<<1; // rev gray ** 1 
7 a “= a<<2; // rev gray ** 2 
8 a “= a««4; // rev gray ** 4 
9 a “= a<<8; // rev gray ** 8 
a ^= a<<16; // rev gray ** 16 
11 #if BITS PER LONG >= 64 
12 a ^- a««32; // for 64bit words 
H #endif : 
return a; 
15 $ 


For the division by x? + 1 use the function 


1 inline ulong bitpol. div x2pi(ulong a) 

2  // Return power series A / (x72+1) 

3  // If A is a multiple of x^2*1, then the returned value 
4  // is the exact division by x^241 

5 t 
6 

7 

8 

9 


a “= a<<2; // rev gray ** 2 

a “= a««4; // rev gray ** 4 

a ^= a<<8; // rev gray ** 8 

a ^= a<<16; // rev gray ** 16 
10 #if BITS PER LONG >= 64 
11 a ^- a««32; // for 64bit words 
13 #endif . 

return a; 
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An algorithm for the exact division by C = 2* +1 (over Z) is given in section|1.21.2 on page 57 
40.2 Multiplying binary polynomials of high degree 


We used the straightforward multiplication scheme which is O(N?) for polynomials of degree N. This is 
fine when working with polynomials of small degree. However, when working with polynomials of high 
degree, the following splitting schemes should be used. 


40.2.1 Karatsuba method: 2-way splitting 
Let U and V be binary polynomials of (even) degree N. Write U = Ug + U1 x/?, V = Vo + V; 20? and 
use the scheme 


UV. = Uo- W% (1+1™) + (U -— Uo): (aV) P +U- Vi (a +e) — (aol) 


Only the three multiplications indicated by a dot are expensive, the multiplications by a power of x 
are just shifts. The resulting scheme is the Karatsuba multiplication for polynomials, relation [28.1-3 on] 
[page 550] interpreted for polynomials (set LN? = B). Recursive application of the scheme leads to the 
asymptotic cost O(.N!82(3)) ~ O( N1585), 


40.2.2 Splitting schemes that do not involve constants 1 


A generalization of the Karatsuba scheme is given in [347] (see also [348]). It does not lead to schemes 
asymptotically better than O(N!°82(3)) but has a simple structure and avoids all multiplications by 
constants. As with 2-way splitting the method works for polynomials over any field. 
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For the 3-way splitting scheme set 


A = az’ +a,2+ ao (40.2-2a) 
B = ba +b, a+ bo (40.2-2b) 
C = AB = curt+oar+cor+cor+0 (40.2-2c) 

Then the c; are 
co = doo (40.2-3a) 
c& = doi — doo — di1 (40.2-3b) 
c2 = do — doo — d2,2 + di 1 (40.2-3c) 
(3 = di — di4 — da,» (40.2-3d) 
Ca = da» (40.2-3e) 

where 

doo = aobo (40.2-4a) 
dii = ab (40.2-4b) 
d22 = abe (40.2-4c) 
doi = (ao +41) (bo + b1) (40.2-4d) 
dos = (ao +2) (bo + ba) (40.2-4e) 
dig = (a1 + a3) (by + b2) (40.2-4f) 


The scheme involves 6 multiplications and 13 additions. Recursive application leads to the asymptotic 
cost (WN 83(60) ~ O(.N-6909) which is slightly worse than for the 2-term scheme. However, applying this 
scheme first for a polynomial with N = 3-2” terms and then using the Karatsuba scheme recursively 
can be advantageous. 


We generalize the method for n-term polynomials and denote the scheme by KA-n. The 2-term scheme 
KA-2 is the Karatsuba algorithm. With 


n-1 
A = Mas (40.2-5a) 
k=0 
n-1 
B = ag (40.2-5b) 
k=0 
2n—2 
C = ABS. X a" (40.2-5c) 
k=0 
define 
des :— agb, fors=0,1,...,n—1 (40.2-6a) 
dst := (as +bs)(a¢+h:) fors+t=i1,t>s>0,1<i<2n-3 (40.2-6b) 
d = M da= M (des + are) (40.2-6c) 
s+t=1 s+t=i 
0<s<t 0<s<n=1 
Then 
Co = doo (40.2-7a) 


Can-2 = An-1m-1 (40.2-7b) 
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and for 0 < i < 2n — 2: 


{ C; if 3 odd (40.2-7c) 


C; + di /2,4/2 otherwise 


The Karatsuba scheme is obtained for n = 2. 
We give GP code whose output is the KA-n algorithm for given n. We need to create symbols ‘ak’ (for 
ax), ‘bk’, and so on: 


1 fa(k)=eval(Str("a" k)) 
2 fb(k)=eval(Str("b" k)) 
3  fc(k)seval(Str("c" k)) 
4 fd(k,j)=eval(Str("d" go j)) 


For example, we can create a symbolic polynomial of degree 3: 


? sum(k=0,3, fa(k) * x^k) 
a3*x73 + a2*x^2 + al*x + a0 


The next routine generates the definitions of all d,. It returns the number of multiplications involved: 


1 D@e= 
2 4 
3 local (mct) ; 
4 mct = 0; \\ count multiplications 
5 for (i=0, n-1, mct+=1; print(fd(i,i), "=", fa(i), "* ", fbi) ) ); 
6 for (t=1, n-1, 
7 for (s=0, t-1, 
8 mct += 1; 
9 print(fd(s,t), "= (", fa(s)*fa(t), ") * (", fb(s)*fb(t), "" ) ; 
10 ); 
11 ); 
12 return(mct); 
13 > 
For n — 3 the output is 
d00 = a0 * bO 
dii = al * bi 
d22 = a2 * b2 
d01 = (aO + al) * (bO + b1) 
d02 = (a0 + a2) * (bO + b2) 
di2 = (al + a2) * (bi + b2) 


The following routine prints c;, the coefficient of the product, in terms of several d,,;. It returns the 
number of additions involved: 


1 C4, n= 

2 t 

3 local(N, s, act); 

4 act = -1; \\ count additions 

5 printi(fc(i), "= "); 

6 for (s=0, i-1, 

7 t-i-s; 

8 if ( (t>s) && (t<n), 

9 act += 3; 

10 printi(" + ", fd(s,t)); 
11 printi(" - ", fd(s,s)); 
12 printi(" - ", fd(t,t)); 
13 25 

14 25 

15 if ( 0==i%2, act+=1; printi(" + ", fd(i/2,i/2)) ); 
16 print(); 

17 return( act ); 

18 } 


It has to be called for all ¿ where 0 € i € 2n — 2. The algorithm is generated by the following routine: 


W generate rules for computation of c i in terms of d_{s,t}: 
for (i20, 2*n-2, act+=C(i,n) ); 


1  KA(m)= 

2 A 

3 local(mct, act); 

4 act = 0; \\ count additions 

5 mct = 0; \\ count multiplications 

6 mct = D(n); \\ generate definitions for the d_{s,t} 
7 

8 
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9 act += n*(n-1); \\ additions when setting up d(i,j) for i!=j 
10 return( [mct, act] ); 
1 > 
With n = 3 we find the relations |40.2-3a]. . .|40.2-3e 
c0 = + ddd 
ci = + d01 - dQO - d11 
c2 = + d02 - d00 - d22 + dil 
c3 = + di2 - dii - d22 
c4 = + d22 
d00 = a0 * bO 
dii = al * bi 
d22 = a2 * b2 
d33 = a3 * b3 
d44 = a4 * b4 
d01 = (a0 + al) * (bO + b1) 
d02 = (a0 + a2) * (bO + b2) 
di2 = (al + a2) * (b1 + b2) 
dO3 = (a0 + a3) * (bO + b3) 
di3 = (ai + a3) * (bi + b3) 
d23 = (a2 + a3) * (b2 + b3) 
d04 = (a0 + a4) * (bO + b4) 
di4 = (ai + a4) * (bi + b4) 
d24 = (a2 + a4) * (b2 + b4) 
d34 = (a3 + a4) * (b3 + b4) 
cO = + dOO 
cl = + d01 - d00 - dil 
c2 = + d02 - d00 - d22 + dil 
c3 = + d03 - d00 - d33 + d12 - d11 - d22 
cA = + d04 - d00 - d44 + d13 - dii - d33 + d22 
cb = + di4 - dii - d44 + d23 - d22 - d33 
c6 = + d24 - d22 - d44 + d33 
c7 = + d34 - d33 - d44 
c8 = + d44 


Figure 40.2-A: Code for the algorithm KA-5. 


Now we generate the definitions for the KA-5 algorithm: 


n=5 /* n terms, degree=n-1 */ 
default(echo, 0); 
KA(n) ; 


We obtain the algorithm KA-5 shown in figure |40.2-A| The format is valid GP input, so we add a few 
lines that print code to check the algorithm: 


print ("A=",sum(k=0,n-1, fa(k) * x^k)) 

print ("B=",sum(k=0,n-1, fb(k) * x^k)) 

print("/* direct computation of the product: */") 
print ("C=A*B") 

print("/* Karatsuba computation of the product: */") 
print ("K=",sum(k=0,2*n-2, fc(k) * x^k)) 

print ("qq=K-C") 

print ("print( if (O==qq, \"OK.\", N" o. DUCH!N") )") 


ANDOU Co b2r- 


This gives for n = 5: 
A-a4*x^4 + a3*x73 + a2*x72 + al*x + a0 
B=b4*x"4 + b3*x^3 + b2*x^2 + bi*x + bO 
/* direct computation of the product: */ 
C=A*B 
/* Karatsuba computation of the product: */ 
Rosana P + CT*x^T  cO*x^6 + cb*x^b + c4*x"4 + c3x*x"3 + C2*x^2 + c1*x + cO 
qq-^7 
print( if(0==qq, "OK.", " **** OUCH!") ) 


We can feed the output into another GP session to verify the algorithm: 
gp -f -q < karatsuba-n.gp | gp 
The output is (shortened and comments added) 


/* definitions of d(s,t): */ 
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b4*a4 

(bO + bi)*aO + (al*bO + bi*ai) 
(bO + b2)*a0 + (a2*bO + b2*a2) 
[--snip--] 

(b3 + b4)*a3 + (a4*b3 + b4*a4) 


/* the c_i in terms of d(s,t), evaluated: */ 
bO*aQ 


bi*a0 + ai*bO 

b2*a0 + (a2*bO + bi*al) 

b3*a0 + (a3*bO + (b2*al + a2*b1)) 

b4*a0 + (a4*bO + (b3ral + (a3*b1 + b2*a2))) 
b4*ai + (a4*bi + (b3*a2 + a3*b2)) 

b4*a2 + (a4*b2 + b3*a3) 

b4*a3 + a4*b3 

b4*a4 


/* polynomials: */ 
a4*x74 + a3*x^3 + a2xx72 + al*x + aQ 
b4*x74 + b3*x^3 + b2*x^2 + bi*x + bO 


/* direct computation of product: */ 
b4*a4*x^8 + (b4*a3 + a4*b3)*x^7 + (b4*a2 + (a4*b2 + b3*a3))*x^6 + [...] 


/* Karatsuba computation of product: */ 
b4*a4*x^8 + (b4*a3 + a4*b3)*x^7 + (b4*a2 + (a4*b2 + b3*a3))*x^6 + [...] 


he difference: */ 


OK. /* looks good */ 


The number of multiplications with the KA-n splitting scheme is (n? + n)/2 which is suboptimal except 
for n = 2. However, recursive application can be worthwhile. One should start with the biggest prime 
factors as the number of additions is then minimized. The number of multiplications does not depend 
on the order of recursion (see [347] which also tabulates the number of additions and multiplications for 
n < 128). 


With n just below a highly composite number one may append zeros as ‘dummy’ terms and recursively 
use KA-n algorithms for small n. For example, for polynomials of degree 63 the recursion with KA-2 
(and n = 64) will beat the scheme “KA-7, then KA-3". 


One can write code generators that create expanded versions of the recursions for n the product of 
small primes. If the cost of multiplication is much higher than for addition (as for binary polynomial 
multiplication on general purpose CPUs), then substantial savings can be expected. 


40.2.3 Toom-Cook algorithms for binary polynomials 


The 3-way and 4-way splitting schemes described in section on page cannot be used with binary 
polynomials because constants other than 0 and 1 are used. We now give splitting schemes that use the 
constants 0 and 1 only, they are given in [59]. The schemes are valid only for binary polynomials. 


40.2.3.1 3-way splitting 
For the multiplication of two polynomials A and B both of degree 3N write 
A = ag-ca a” + as any =: ao 4- a1 Y 4- a2 y? (40.2-8) 


and identically for B. A 3-way splitting scheme for multiplication is shown in figure |40.2-B| The multi- 
plications and divisions by x are shifts and the exact divisions are linear operations if we use the method 


of section |40.1.6 on page 826 
40.2.3.2 4-way splitting 
For the multiplication of two polynomials A and B both of degree 4N write 


A = ay+a Y +02, Y? + as Y? (40.2-9) 


where Y := zx and identically for B. The 4-way splitting multiplication scheme is shown in figure|40.2-C 
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A = a2*Y^2 + alx*Y + a0 

B = b2*Y^2 + bi*Y + bO 

S3 = a2 + al + a0; 

S2 = b2 + bi + b0; 

$1 = S3 $2; \ Mult (1) 
SO = a2*x^2 + al*x; 

S4 = b2*x^2 + bi*x; 

S3 += S0; 

$2 += 84; 

SO += a0; 

S4 += bO; 

S3 *- 82; \\ Mult (2) 
S2 = SO * 84; \\ Mult (3) 
S4 = a2 * b2; \\ Mult (4) 
SO = a0 * bO; \\ Mult (5) 
S3 += $2; 


S2 += S0; S2 /= x; S2 += S3; 
T = S54; T *= (x73+1); NN temporary variable 
S2 += T; S2 /= (x+1); \\ exact division 


Si += SO; 

S3 += 81; S3 /= x; S3 /= (x+1); Al exact division 
S1 += S4; S1 += S2; 

$2 += 83; 


P = 8S4*Y^4 + S3*Y73 + S2*Y72 + S1*Y + SO; 
Mod(1,2)* (P - A*B) NN == zero 
(P - A*B) NN NOT zero, the scheme only works over GF(2) 


Figure 40.2-B: Implementation of the 3-way multiplication scheme for binary polynomials. The five 
expensive multiplications are commented with ‘Mult (n)’. 


40.2.4 FFT based methods 


For polynomials of very high degree FFT-based algorithms can be used. The simplest method is to use 
integer multiplication without the carry phase (which is polynomial multiplication!). We give an example 
using decimal digits. The carry phase of the integer multiplication is replaced by a reduction modulo 2: 


100110111 * 110101 
== 11022223331211 // integer multiplication 


== 11000001111011 // parity of digits 


The scheme will work for polynomials of degree less than nine only. When using an FFT multiplication 
scheme (see section 8.2 on page 558), we can multiply polynomials up to degree N as long as the integer 
values 0, 1, 2... N +1 can be distinguished after computing the FFT. This is hardly a limitation at all: 
with the C-type float (24 bit mantissa) polynomials up to degree one million can be multiplied assuming 
at least 20 bits are correct after the FFT. With type double (53-bit mantissa) there is no practical limit. 
While the algorithm is very easy to implement it is not competitive to well implemented splitting schemes 
and the FFT method described in [303] or the multiplication algorithm given in [92]. An excellent source 
for multiplication algorithms for binary polynomials is [85]. 


40.3 Modular arithmetic with binary polynomials 


Here we consider arithmetic of binary polynomials modulo a binary polynomial. Addition and subtraction 
are again the XOR operation and no modular reduction is required. 


40.3.1 Multiplication and squaring 
Multiplication of a polynomial A by x modulo (a polynomial) C can be done by shifting left and sub- 
tracting C if the coefficient shifted out is one [FXT: bpol/bitpolmod-arith.h : 


1 static inline ulong bitpolmod times x(ulong a, ulong c, ulong h) 


2  // Return (A * x) mod C 
3  // where A and C represent polynomials over Z/2Z: 
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A = a3*Y^3 + a2*Y^2 + al*Y + a0; 

B = b3*Y^3 + b2*Y72 + bi*Y + bO; 

S1 = a3 + a2 + al + a0; 

82 = b3 + b2 + bi + bO: 

$3 = S1 * S2; \\ Mult (1) 
S0 = al + x*(a2 * x*a3); 

S6 = bi + x*(b2 + x*b3); 

S4 = (SO + a3*(x*1))*x + S1; 

S5 = (S6 + b3* (x+1)) *x + S2; 

SO = SO*x + a0; 

S6 = S6*x + bO; 

S5 = S5 * S4; \\ Mult (2) 
S4 - S0 \\ Mult (3) 
SO = a0xx"3 + ai*x^2 + a2*x; 

S6 = bO*x^3 + bi*x^2 + b2*x; 

S1 = 81 + SO + a0*(x"2+x); 

$2 = S2 + S6 + bO* (x7 2+x) ; 

SO = SO + a3; 

S6 = S6 + b3; 

S1 = S1 * S2; \\ Mult (4) 
S2 = S0 * S6; \\ Mult (5) 
S6 = a3 * b3; \\ Mult (6) 
SO = a0 * bO; \\ Mult (7) 
S1 = S1 + S2 + SOx(x"4+x"2+1); 

S5 = (S5 + 84 + S6*(x^4*x^ 2+1) + 81) N (x74+x); 
S2 = 82 + S6 + SO*x^6; 

S4 = S4 + S2 + SÓ*x- 6 + SO; 

S4 = (SA + S5*(x75+x)) \ (x^ 4*-x^2); 

$3 = $3 + SO + S6; 

S1 = S1 + S3; 

82 = 82 + S1*x + S3*x^2; 

$3 = $3 + 84 + S5; 

Si = (st + S3«(x^2430) \ (x74+x); 

$5 = + S1; 

$2 = (32 + S5* (x^ 2+x)) N (x74+x72); 

$4 = + $82; 


= 86*Y^6 + Sb*Y^b + S4*Y74 + S3*Y73 + S2*Y^2 + S1¥*Y + SO; 
Mod(1,2)*(P - A*B) AN == zero 
(P - A*B) NN NOT zero, the scheme only works over GF(2) 
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Figure 40.2-C: Implementation of the 4-way multiplication scheme for binary polynomials. The seven 


expensive multiplications are commented with ‘Mult (n)’. 


// W= pol(w) =: Nsum kí [bit_k(w)] * x^k) 
// h needs to be a mask with one bit set: 
// h == highest one(c) >> 1 == 1UL << (degree(C)-1) 
{ 
ulong s = a & h; 
a <<= 1; 
if(s) a‘*=c; 
return a; 
} 


To avoid the repeated computation of the highest set bit, we introduced the auxiliary variable h that has 
to be initialized as described in the comment. Section [1.6 on page 14] gives algorithms for the function 
highest_one(). Note that h needs to be recomputed only if the degree of the modulus C changes, which 
is usually only once for a series of calculations. By using the variable h we can use the routine even if 


the degree of C equals the number of bits in a word in which case C does not fit into a word. 


'The routine for the multiplication of two polynomials a and b modulo C is obtained by adding a reduction 


step to the binary multiplication routine: 


inline ulong bitpolmod mult(ulong a, ulong b, ulong c, ulong h) 
4 Return (A * B) mod C 


ulong t 


= 0; 
while ( b ) 
{ 
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T if (b&1) t =a; 
8 b >>= 1; 
18 ulong s =a & h; 
11 a <<= 1; 
12 if (s) a^-2c; 
13 
14 return t; 
15 $ 


40.3.2 Optimization of the squaring and multiplication routines 


Squaring a can be done by the multiplication a-a. If many squarings have to be done with a fixed modulus 
then the optimization using a precomputed table of the residues z?* mod C shown in section 
page 886|can be useful. Squaring of the polynomial M4 aj, 2* is the computation of the sum y ap x7" 


modulo C. We use the auxiliary function 
1 static inline ulong bitpolmod times x2(ulong a, ulong c, ulong h) 
2 4 Return (A * x * x) mod C 
3 
4 { ulong s=akh; a<<=1; if (s) a*=c; } 
5 { ulong s=akh; a<<=1; if (s) a*=c; } 
6 return a; 
T 
The squaring function, with a 4-fold unrolled loop, is 
1 static inline ulong bitpolmod square(ulong a, ulong c, ulong h) 
2 4 Return A*A mod C 
3 
4 ulong t = 0, s = 1; 
5 do 
6 { 
T if (a&1) t*=s; a>>=1; s-bitpolmod times x2(s, c, h); 
8 if (a&1) t^-s; a>>=1; s-bitpolmod times x2(s, c, h); 
9 if (a&1) t^-s; a>>=1; s-bitpolmod times x2(s, c, h); 
10 if (a&1) t^-s; a>>=1; s=bitpolmod_times_x2(s, c, h); 
11 } 
12 while ( a ); 
13 return t; 
14 $ 


Whether the unrolled code is used can be specified via the line 


#define MULT_UNROLL // define to unroll loops 4-fold 
The optimization used for the multiplication routine is also unrolling as described in section |40.1.2 on 
1 static inline ulong bitpolmod mult(ulong a, ulong b, ulong c, ulong h) 
2 
3 : ulong t = 0; 
4 do 
5 1 
6 { if(b&1) t^-a; b>>=1; ulong s=akh; a<<=1; if(s) a^-2c; $ 
7 { if(b&1) t^-a; b>>=1; ulong s=akh; a<<=1; if(s) a^-2c; $ 
8 { if(b&1) t^-a; b>>=1; ulong s=akh; a<<=1; if(s) a^-2c; $ 
9 f if(b&1) t^-a; b>>=1; ulong s=a&h; a<<=1; if(s) a^-2c; } 
10 } 
11 while ( b ); 
12 return t; 
13 $ 


It turns out that squaring via multiplication is slightly faster than via the described sum computation. 


40.3.3 Exponentiation 


The following routine for modular exponentiation uses the right-to-left powering algorithm from sec- 
tion |28.5.1 on page 563| [FXT: bpol/bitpolmod-arith.h : 


1 inline ulong bitpolmod power(ulong a, ulong e, ulong c, ulong h) 

2  // Return (A ** e) mod C 

3 

4 if ( 0-2e ) return 1; // avoid hang with e==0 in next while() 
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3 ulong s = a; 

7 while ( 0==(e&1) ) 

8 { 

9 s = bitpolmod_square(s, c, h); 
10 e >>= 1; 

11 } 

là a= s; 

14 while ( 0!=(e>>=1) ) 

15 { 

16 s = bitpolmod_square(s, c, h); 
17 if (e& 1) a = bitpolmod mult(a, s, c, h); 
18 F 

19 return a; 

20 } 


The left-to-right powering algorithm given in section }28.5.2 on page 564|can be implemented as: 


1 inline ulong bitpolmod_power (ulong a, ulong e, ulong c, ulong h) 
2 { 

3 ulong s = a; 

4 ulong b = highest_one(e); 

while ( b>1 ) 

7 b >>= 1; 

8 s = bitpolmod_square(s, c, h); // s *= s; 

9 if (e& b) s = bitpolmod_mult(s, a, c, h); // s *= a; 
10 } 

11 return s; 

12 } 


Computing a power of z can be optimized with this scheme: 


1 inline ulong bitpolmod xpower(ulong e, ulong c, ulong h) 
2 d Return (x ** e) mod C 

3 

4 ulong s = 2; //^?x 

5 ulong b = highest_one(e) ; 

6 while ( b>1 ) 

T 

8 b >>= 1; 

9 S = bitpolmod_square(s, c, h); // s *= s; 

10 if (e& b) s = bitpolmod times x(s, c, h); // s *= x; 
11 } 

12 return s; 

13 $ 


40.3.4 Division by x 


Division by z is possible if the modulus has a nonzero constant term (that is, gcd(C, z) = 1). The routine 


is quite simple [FXT: bpol/bitpolmod-arith.h |: 


1 Static inline ulong bitpolmod div x(ulong a, ulong c, ulong h) 
2 // Return (A / x) mod C 

A // C must have nonzero constant term: (c&1)== 

5 ulong s =a & 1; 

6 a >>= 1; 

7 if (s) 

8 { 

9 a ^= (c>>1); 

10 " a |= h; // so it also works for n == BITS PER LONG 
11 

12 return a; 

13 > 


If we do not insist on correct results for the case that the degree of C equals the number of bits in a 
word, we could simply use the following two-liner: 


if (a&1) a^-c; 
a>> 1; 


'The operation needs only about two CPU cycles. The inverse of z can be computed with: 


1 static inline ulong bitpolmod inv x(ulong c, ulong h) 
2  // Return (1 / x) mod C 
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// C must have nonzero constant term: (c&i)-- 


ulong a = (c>>1); 


a |» h; // so it also works for n == BITS. PER LONG 
return a; 


COND orb 


40.3.5 Inversion and division 


The method to compute the extended GCD (EGCD) is the same as given in section |39.1.4| on page 
[FX T: bpol/bitpol-gcd.h : 


1 inline ulong bitpol_egcd(ulong u, ulong v, ulong &iu, ulong &iv) 
2 // Return u3 and set ul,vi so that gcd(u,v) == u3 == u*ul + v*u2 
3 1 

4 ulong ul = 1, u2=0; 

5 ulong vi = 0, v3 = v; 

6 ulong u3 =u, v2- 1; 

7 while ( v3!=0 ) 

8 1 

9 ulong q = bitpol div(u3, v3); // == u3 / v3; 
10 
11 ulong ti = ul ^ bitpol mult(vi, q); // == ul - vi * q; 
12 ul = vi; vi = tl; 
13 
14 ulong t3 = u3 ^ bitpol mult(v3, q); // == u3 - v3 * q; 
15 u3 = v3; v3 = t3; 
16 
17 ulong t2 = u2 ^ bitpol mult(v2, q); // == u2 - v2 * q; 
18 u2 = v2; v2 = t2; 
19 } 
39 iu = ul; iv = u2; 
22 return u3; 
23 } 

The routine can be optimized using bitpol_divrem(): remove the lines 

ulong q = bitpol_div(u3, v3); // == u3 / v3; 
[--snip--] 
ulong t3 = u3 ^ bitpol mult(v3, q); // == u3 - v3 * q; 


and insert at the beginning of the body of the loop: 


ulong q, t3; 
bitpol divrem(u3, v3, q, t3); 


The routine computes the GCD g and two additional quantities ?,, and 2, so that 


g = U- ly +U- 0d 
If g — 1, we have 
1 = u-i, mod y 


That is, 4, is the inverse of u modulo v. The implementation is 


1 inline ulong bitpolmod_inverse(ulong a, ulong c) 

2 // Returns the inverse of A modulo C if it exists, else 
3 // Must have deg(A) < deg(C) 

4 4 

5 ulong i, t; // t unused 

6 ulong g = bitpol_egcd(a, c, i, t); 

7 if ( g!=1 ) i50; 

8 return i; 

9 } 


Modular division is done by multiplication with the inverse: 


inline ulong bitpolmod_divide(ulong a, ulong b, ulong c, 
// Return a/b modulo c. 

// Must have: gcd(b,c)==1 

{ 


dia WON e 


ulong i = bitpolmod_inverse(b, c); 


[FXT: bpol/bitpolmod-arith.h 


zero. 


ulong h) 
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6 a = bitpolmod_mult(a, i, c, h); 
T return a; 
8 
The inverse of a number a modulo a prime m can be computed as a^! = a"? (m — 1 is the maximal 


order of an element in Z/mZ). With an irreducible (see section 40.4) polynomial C of degree n the inverse 


modulo C of a polynomial A can be computed as AT! = A?" ? (27 — 1 is the maximal order modulo C, 
see section |40.5 on page 841): 


inline ulong bitpolmod inverse irred(ulong a, ulong c, ulong h) 
// Return (A ** -1) mod C 
4 Must have: C irreducible. 


ulong ri = (h<<1) - 2; // max order minus one 
ulong i = bitpolmod power(a, ri, c, h); 
return i; 


Oo -1 OC» oBWwWNr 


40.4 Irreducible polynomials 


A polynomial is called irreducible if it has no nontrivial factors (trivial factors are the constant polyno- 
mial ‘1’ and the polynomial itself). A polynomial that has a nontrivial factorization is called reducible. 
The irreducible polynomials are the ‘primes’ among the polynomials. 


The factorization of a polynomial depends on its coefficient field: The polynomial x? + 1 over R (or Z) 
is irreducible. Over C it factors as (x? + 1) = (x + i) (x — i). As a binary polynomial, the factorization 
is (x? +1) = (£ +1}. 

All polynomials with zero constant coefficient (except x) are reducible because they have the factor x. A 
binary polynomial that is irreducible has at least one nonzero coefficient of odd degree (else it would be 
a square). All binary polynomials except for x + 1 that have an even number of nonzero coefficients are 
reducible because they have the factor x + 1. 


40.4.1 Testing for irreducibility 


Irreducibility tests for binary polynomials use the fact that the polynomial z?' — x = x?” + x has all 
irreducible polynomials whose degrees divide n as factors. For example, with n — 6 we get 


ote = ax (40.4-1a) 
(z) - (z4- 1)- (40.4-1b) 


1) 
‘(2 +241): (x? x 1)- 
1) - (z9 -- 2? 1) (zf +24 +2? 2-1). 
(xf +24 +2? m 1): (xf +2 +1). (a9 r’ + a? r1). 


- (£f +25 a? Ez? 1) (£f +a? + attal). (x9 +a + at +? +1) 


40.4.1.1 The Ben-Or test for irreducibility 


A binary polynomial C of degree d is reducible if ecd(x?" — x mod C, C) 4 1 for any k < d. We compute 


up = x? (modulo C) for each k < d by successive squarings and test whether ged(uz + x, C) = 1 for 
all k. But as a factor of degree f implies another one of degree d — f it suffices to do the first |d/2] of the 


tests. The algorithm is called the Ben-Or irreducibility test. A C++ implementation is given in [FXT: 
bpol/bitpol-irred-ben-or.cc|: 


bool bitpol irreducible q(ulong c, ulong h) 

// Return whether C is irreducible (via the Ben-Or irreducibility test ; 
// h needs to be a mask with one bit set: 

// h == highest one(C) >> 1 == 1UL << (degree(C)-1) 

1 


oew Ne 
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6 if ( c<4 ) 

7 

8 if ( c>=2 ) return true; // x, and 1*x are irreducible 

9 else return false; // constant polynomials are reducible 
10 } 

12 if ( 0==(1&c) ) return false; // x is a factor 

14 // if ( O--(c & OxaaaaaaaaUL ) ) return 0; // at least one odd degree term 
15 // if ( O--parity(c) ) return 0; // need odd number of nonzero coeff. 
16 // if ( Ol-bitpol test squarefree(c) ) return 0; // must be square-free 
18 ulong d = h >> 1; 

19 ulong u = 2; // ="=x 

20 while (0 !2 d) // floor( degree/2 ) times 

21 1 

22 // Square r-times for coefficients of c in GF(2*r). 

23 // We have r--1: 

24 u = bitpolmod square(u, c, h); 

26 ulong upx = u ^2; // ="= utx 

28 ulong g = bitpol_binary_gcd(upx, c); 

30 if ( 1!=g ) return false; // reducible 

33 d >>= 2; 

34 return true; // irreducible 

35 } 


Commented out at the beginning are a few tests for some necessary conditions for irreducibility. For the 


test bitpol_test_squarefree() (for a square factor) see section |40.12.2 on page 860| The routine will 


fail if deg c =BITS_PER_LONG, because the gcd-computation fails in this case. 
40.4.1.2  Rabin's test for irreducibility 
A binary polynomial C of degree d is irreducible if and only if 
a = xmodC (40.4-2a) 
and, for all prime divisors p; of d 


gcd Call — z mod C, c) Eu (40.4-2b) 


The implied test is called Rabin’s algorithm for irreducibility testing, see [276] p.7]. The number of GCD 
computations equals the number of prime divisors of d. 


If the prime divisors are processed in decreasing order, the successive exponents are increasing and the 
power of x can be updated via squarings. The total number of squarings equals d which is minimal. 


A C++ implementation of Rabin's test is given in [FXT: bpol/bitpol-irred-rabin.cc|. A table of auxiliary 
omputations: 


bit-masks gives the number of squarings between the GCD c 


1 static const ulong rabin_tab[] = 

2 A 

3 OUL, 1/ x= 0 (bits: ........... ) OPS: 

4 QUL,  // x — 1 its? aero ) OPS: finally sqr 1 times 

5 OUL, // x =2 (bits: ........... ) OPS: finally sqr 2 times 

6 OUL, .// Z= 3 (bites «se ) OPS: finally sqr 3 times 

7 4UL, // x=4 (bits: ........ 1..) OPS: sqr 2 times, finally sqr 2 times 

8 OUL, O Xbitst e ) OPS: finally sqr 5 times 

9 12UL, // x= 6 (bits: ....... 11..) OPS: sqr 2 times, sqr 1 times, finally sqr 3 times 
10 OUL, // x =7 (bits: ........... ) OPS: finally sqr 7 times 

11 16UL, // x =8 (bits: ...... 1....) OPS: sqr 4 times, finally sqr 4 times 

12 SUL, // x=9 (bits: ....... 1...) OPS: sqr 3 times, finally sqr 6 times 

13 36UL, // x = 10 (bits: ..... 1..1..) OPS: sqr 2 times, sqr 3 times, finally sqr 5 times 
14 [--snip--] 


The GCD computation for the divisor 1 can be avoided by noting that only the polynomial z? + z = 
(a + 1) x would wrongly pass the test, so we exclude the factor x explicitly. The testing routine is 
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inline bool bitpol_irreducible_rabin_q(ulong c, ulong h) 
// Return whether C is irreducible (via Rabin’s irreducibility test). 
// h needs to be a mask with one bit set: 
// h == highest one(C) >> 1 == 1UL << (degree(C)-1) 
1 
if (c«4) // C is one of 0, 1, x, 1+x 
1 
if ( c>=2 ) return true; // x, and 1+x are irreducible 
else return false; // constant polynomials are reducible 
} 
if ( 0==(1&c) ) return false; // x is a factor 
ulong d - 1 * lowest one idx(h); // degree 
ulong rt = rabin tab[d]; 
ulong m = 2; // ="= ^x? 
while ( rt > 1) 
{ 
do 
{ 
--d; 
m 7 bitpolmod square(m, c, h); 
rt >>= 1; 


} 
while ( 0 == (rt & 1) ); 


ulong g = bitpol binary gcd( m ^ 2UL, c ); 
if (g!-1) return false; 


} 


do {m = bitpolmod_square(m, c, h); } while ( --d D); 
if (m^ 2UL) return false; 


return true; 


} 


Rabin’s test will be faster than the Ben-Or test if the polynomial is irreducible. If the polynomial is 
reducible and has small factors (as often the case with ‘random’ polynomials), then the Ben-Or test will 
be faster. A comparison of the tests is given in [150]. 


40.4.1.3 Testing for irreducibility without GCD computations 


Call a binary polynomial C of degree d that has no linear factors and for which 
a” = amodC (40.4-3) 
and, for all | < d, 
z? = amodC (40.4-4) 
a strong pseudo irreducible (SPI). The test whether a polynomial is SPI does not involve any GCD 
computation. The test for a polynomial C of degree d can be given as 
1. If C has a linear factor (x or x + 1), then return false. 
2. For k = 1,...,d compute sp := x? mod C by successive squarings. 
3. If sk = x for any k < d, then return false. 
4 


. If sg A x, then return false. 


5. Return true. 


If d is a prime, the power of a prime, or the product of two primes, then strong pseudo irreducibility 
implies irreducibility (see [23]). We list the degrees 1 « d « 63 where strong pseudo irreducibility does 
not imply irreducibility (and GCDs are needed for irreducibility testing): 


12, 18, 20, 24, 28, 30, 36, 40, 42, 44, 45, 48, 50, 52, 54, 56, 60, 63 
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The sequence is entry A102467 in [312]. For the degrees 44 = 4-11 and 52 = 4-13 no GCDs are needed 


because (see [23]) if d = r* s where r and s are distinct primes and s > (ar — 2)/r. We have r* = 4 so 
we need s > (24 — 2)/2 = 7 which holds for primes s > 11. 


In the implementation of the SPI test an extra branch is needed if the polynomial C does not fit into a 


word. In that case the parity must be even [FXT: |bpol/bitpol-spi.cc : 


1 bool 

2  bitpol spi q(ulong c, ulong h) 

3  // Return whether C is a strong pseudo irreducible (SPI). 

4  // A polynomial C of degree d is an SPI if 

5 // it has no linear factors, x^(2^k)!-x for O«k«d, and x^(2^d)--x. 
6  //h needs to be a mask with one bit set: 

7 // h == highest one(C) >> 1 == 1UL << (degree(C)-1) 

8 4 

9 const bool md = (bool) ((h<<1)==0); // whether degree == BITS PER LONG 
10 

11 if (md) 

12 

13 if ( (c&1)==0 ) return false; // factor x 

14 if ( O != parity(c) ) return false; // factor x+1 

15 } 

16 else 

17 { 

18 if ( c<4 ) // C is one of O, 1, x, 1*x 

19 

20 if ( c>=2 ) return true; // x, and 1+x are irreducible 
21 else return false; // constant polynomials are reducible 
22 } 

23 

24 if ( (ck1)==0 ) return false; // factor x 

25 if ( 0 == parity(c) ) return false; // factor x+1 

26 } 

38 ulong t = 1; 

29 ulong m = 2; // x 

30 m = bitpolmod_square(m, c, h); 

31 do 

32 { 

33 if ( m==2 ) return false; 

34 m = bitpolmod_square(m, c, h); 

35 t <<= 1; 

36 F 

37 while ( t!=h ); 

38 

39 if ( m!=2 ) return false; 

49 return true; 

42 } 


An auxiliary function returns whether GCDs are needed with the irreducibility test (64-bit version): 


1 bool bitpol_need_gcd(ulong h) 

2 // Return whether GCDs are needed for irreducibility test. 

3 t 

4 // degrees where GCDs are needed: 

5 // 12, 18, 20, 24, 28, 30, 36, 40, 42, 45, 48, 50, 54, 56, 60, 63 
6 const ulong gn - 

7 (1UL<<12) | (1UL<<18) | (1UL<<20) | (1UL<<24) | (1UL<<28) | (1UL<<30) | 
8 (1UL<<36) | (1UL<<40) | (1UL<<42) | (1UL<<45) | (1UL««48) | (1UL««50) | 
9 (1UL««54) | (1UL<<56) | (1UL««60) | (1UL<<63) ; 

10 return 0 != ( h & (gn>>1) ); 

11 7 


Now the irreducibility test can be implemented as follows: 


inline bool bitpol irreducible q(ulong c, ulong h) 


1 
2 A 
3 if ( bitpol need gcd(h) ) return bitpol irreducible ben or.q(c, h); 
4 else return bitpol spi q(c, h); 

5 


As the SPI test also works for polynomials not fitting into a word we can test those for irreducibility. 


mmm 
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40.5 Primitive polynomials 


Let C be an irreducible polynomial. Then the sequence p; = z^ mod (C), k = 1,2,... is periodic and 
the (smallest) period m of the sequence is the order of x modulo C. We call m the period (or order) of 
the polynomial C. For a binary polynomial of degree n the maximal period equals 2" — 1. 


For the period m of C we have x™ = 1 mod C, so z" — | = 0 mod C. That is, C divides z"' — 1 but no 
polynomial x* — 1 with k < m. 


A polynomial is called primitive if its period is maximal. Then the powers of x generate all nonzero 
binary polynomials of degree < n — 1. The polynomial «x is a generator (‘primitive root’) modulo C. 
Primitivity implies irreducibility, the converse is not true. 


The situation is somewhat parallel to the operations modulo an integer: 


e Among those integers m that are prime some have the primitive root 2: the sequence 2* for 
k — 1,2,...,m — 1 contains all nonzero numbers modulo m (see chapter |26 on page 535). 


e Among those polynomials C that are irreducible some are primitive: the sequence x” for k = 
1,2,...,2” — 1 contains all nonzero polynomials modulo C. 


Note that there is another notion of the term ‘primitive’, that of a polynomial for which the greatest 
common divisor of all coefficients is one. 


40.5.1 Roots of primitive polynomials have maximal order 


A different characterization of primitivity is as follows. Suppose you want to do computations with 
linear combinations A = V7 1 a, o^ (where aj € GF(2)) of the powers of an (unknown!) root a of an 
k=0 


irreducible polynomial C = z” + Pa cy x^. 


When multiplying A with the root a we get a term a” which we want to get rid of. But we have 
o" = —Y a" (40.5-1) 


as a is a root of the polynomial C. Therefore we can use exactly the same modular reduction as with 
polynomial computation modulo C. The same is true for the multiplication of two linear combinations 
(of the powers of the same root a). 


We see that the order of a polynomial p is the order of its root œ modulo p and that a polynomial is 
primitive if and only if its root has maximal order. An irreducible polynomial C of degree n has n distinct 
roots, they are equal to a?” mod C for 0 X k « n. The orders of all roots are identical. 


40.5.2 Testing for primitivity 


Checking a degree-d binary polynomial for primitivity by directly using the definition has complexity 
O(2%) which is prohibitive except for tiny d. A much better solution is a modification of the algorithm 
to determine the order in a finite field given in section [39.7.1.2] on page [779] The implementation given 
here uses the GP language: 


polorder(p) = 
/* Order of x modulo p (p irreducible over GF(2)) */ 
1 

local(g, gi, te, tp, tf, tx); 

= x; 

p *= Mod(1,2); 

te = nn_; 

for(i=1, np_, 


tf = vf [i]; tp = vp [i]; tx = vx [il; 
te = te / tf; 

gl = Mod(g, p)”te; 

while ( 1!=g1, 
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13 gi = gi^tp; 
14 te = te * tp; 
15 J; 
16 23 
17 return( te ); 
18 5 
The function uses the following global variables that must be set up before call: 
1 nn _=0; /* max order = 2^n-1 */ 
2 np_ = 0; /* number of primes in factorization */ 
3 vp_ = []; /* vector of primes */ 
4 vf_= []; /* vector of factors (prime powers) */ 
5 vx = []; /* vector of exponents */ 


As given, the algorithm will do n, exponentiations modulo p where n, is the number of different primes in 


the factorization in m. A C++ implementation of the algorithm is given in [FXT: bpol/bitpol-order.cc.. 


A shortcut that makes the algorithm terminate as soon as the computed order drops below maximum is 


1  polmaxorder q(p) = 

2 /* Whether order of x modulo p is maximal (p irreducible over GF(2)) */ 
3  /* Early-out variant */ 

4 4 

5 local(gi, te, tp, tf, tx, ct); 
6 p *- Mod(1,2); 

7 te = nn_; 

8 for(i=1, np_, 

9 tf = vf [i]; tp = vp [i]; tx = vx [il]; 
10 te = te / tf; 

11 gl = Mod(g, p)^te; 

12 ct = 0; 

13 while ( 1!=g1, 

14 gl = gi^tp; 

15 te = te * tp; 

16 ct = ct + 1; 

17 23 

18 if ( ct<tx, return(0) ); 
19 25 
20 return(1); 
21 $ 


With polmaxorder_q() and GP's built-in polisirreducible() the search for the lexicographically min- 
imal primitive polynomials up to degree n — 100 is a matter of about ten seconds. Extending the list 
up to n = 200 takes three minutes. The computation of all polynomials up to degree n = 400 takes less 
than an hour. 


Again, the algorithm depends on precomputed factorizations. The table [FXT: data/mersenne-factors.txt 
taken from [89] was used to save computation time. 


For prime m = 2” — 1 (that is, m is a Mersenne prime) irreducibility suffices for primality: The one-liner 
n=607; for(k-1,n-1,if(polisirreducible(Mod(1,2)*(1*t^ktt^n)) ,printi(" ",k))) 


finds all primitive trinomials z” --z^--1 whose degree is the Mersenne exponent n = 607. The computation 
of the following list takes about two minutes. 


89: 38,51 

127: 1,7 15,30 63 64 97 112 120 126 
521: 32 48 158,168, 353 363 473 489 
607 105 147 273 334 460 502 


Note we did not exploit the symmetry (reversed polynomials are also primitive). Techniques to find 
primitive trinomials whose degrees are very big Mersenne exponents are described in and [84]. 


Here is a surprising theorem: Let p(x) = ia cy x^ be an irreducible binary polynomial and L(x) := 
pu cy 12. Then all irreducible factors of L,,(x)/x (a polynomial of degree 27 — 1) are of degree equal 
to ord(p) (the order of x modulo p(x)). Especially, if p(x) is primitive, then L,(«)/x is irreducible. The 
theorem is proved in [368] and also in [233] p.110]. An example: 27 + x +1 is primitive, so zi?" +x +1 
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9127 _ 


is irreducible. But, as 2127 — 1 is prime, 117 + x + 1 is also primitive. Therefore x L+g+l is 


irreducible. 


40.6 The number of irreducible and primitive polynomials 


n: I, n: Le n: T, n: I 
1: 2 11: 186 21: 99858 31: 69273666 
2: 1 12: 335 22: 190557 32: 134215680 
3: 2 13: 630 23: 364722 33: 260300986 
4: 3 14: 1161 24: 698870 34: 505286415 
5: 6 15: 2182 25: 1342176 35: 981706806 
6: 9 16: | 4080 26: 2580795 36: | 1908866960 
T: 18 17: 7710 27: 4971008 37: | 3714566310 
8: 30 18: 14532 28: | 9586395 38: | 1233615333 
9: 56 19: 27594 29: 18512790 39: 14096302710 

10: 99 20: 52377 30: 35790267 40: 27487764474 


Figure 40.6-A: The number of irreducible binary polynomials for degrees n < 40. 


nz Pr n: B, n: P, n: P; 
1: 1 11: 176 21: 84672 31: 69273666 
2; 1 12: 144 22: 120032 32: 67108864 
3: 2 13: 630 23: 356960 33: 211016256 
4: 2 14: 756 24: 276480 34: 336849900 
5: 6 15: 1800 25: | 1296000 35: 929275200 
6: 6 16: 2048 26: 1719900 36: 725594112 
7: 18 17: 7710 27: | 4202496 37: | 3697909056 
8: 16 18: 7776 28: 4741632 38: | 4822382628 
9: 48 19: 27594 29: 18407808 39: 11928047040 

10: 60 20: 24000 30: 17820000 40: 11842560000 


Figure 40.6-B: The number of primitive binary polynomials for degrees n < 40. 


The number of irreducible binary polynomials of degree n is 


-" - You) grid = ~ Y (n/a) a (40.6-1) 


d\n d\n 


The Mobius function p is defined by relation The expression is identical to the 
formula for the number of Lyndon words (relation [18.3-2 on page 380). If n is prime, then I, = 23. 
Figure gives I„ for n < 40, the sequence is entry in |312]. The list of all irreducible 
polynomials up to degree 11 is given in [FXT: 


data/all-irredpoly.txt.. 
For large degrees n the probability that a randomly chosen polynomial is irreducible is about 1/n. With 
polynomials in two or more variables the situation is very different: the probability that a random 
polynomial is irreducible tends to 1 for large n, see [58]. 


The number of primitive binary polynomials of degree n equals 


Pu = dH (40.6-2) 


If n is the exponent of a Mersenne prime we have P, = 2—2 = [,. The values of P, for n < 40 are 


shown in figure [40.6-B| The sequence is entry |A011260 in 313]. 'The list of all primitive polynomials up 
to degree 11 is given in [FXT: data/all-primpoly.txt 
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n: Ds n: Dn n: Dn n: Dn 
1: 1 11: 10 21: 15186 31: 0 
2: 0 12: 191 22: 70525 32: 67106816 
3: 0 13: 0 23: 7762 33: 49284730 
4: 1 14: 405 24: 422390 34: 168436515 
5: 0 15 382 25: 46176 35: 52431606 
6: 3 16: 2032 26: 860895 36: 1183272848 
T: 0 17: 0 27: 768512 37: 16657254 
8: 14 18: 6756 28: 4844763 38: | 2411232705 
9: 8 19: 0 29: 104982 39: | 2168255670 

10: 39 20: 28377 30: 17970267 40: 15645204474 


Figure 40.6-C: The number of irreducible non-primitive binary polynomials for degrees n « 40. 


n: Pa / Ln n: Pa / Ln n: DH n: Pa / Ln 
1: 0.50000000 26: 0.66642256 51: 0.84834222 16: 0.52983738 
2: 1.0 27: 0.84540117 52: 0.51936149 77: 0.93832726 
3: 1.0 28: 0.49462097 53: 0.99982834 78: 0.56391518 
4: 0.66666667 29: 0.99432922 54: 0.53392943 79: 0.99962783 
5: 1.0 30: 0.49790073 55: 0.91393553 80: 0.42915344 
6: 0.66666667 31: 1.0 56: 0.46549716 81: 0.84506003 
7: 1.0 32: 0.50000763 57: 0.85711404 82: 0.65858526 
8: 0.53333333 33: 0.81066253 58: 0.65165057 83: 0.99401198 
9: 0.85714286 34: 0.66665141 59: 0.99999444 84: 0.38979140 

10: 0.60606061 35: 0.94659138 60: 0.35255399 85: 0.96773455 

11: 0.94623656 36: 0.38011770 61: 1.0 86: 0.66505112 

12: 0.42985075 37: 0.99551569 62: 0.66666667 87: 0.85207814 

13: 1.0 38: 0.66666285 63: 0.83624531 88: 0.47128978 

14: 0.65116279 39: 0.84618267 64: 0.49921989 89: 1.0 

15: 0.82493126 40: 0.43083023 65: 0.96762379 90: 0.46446197 

16: 0.50196078 41: 0.99992518 66: 0.53157031 91: 0.99091593 

17: 1.0 42: 0.55199996 67: 0.99999999 92: 0.51925414 

18: 0.53509496 43: 0.99757669 68: 0.52884860 93: 0.85714286 

19: 1.0 44: 0.50216809 69: 0.83890107 94: 0.66388120 

20: 0.45821639 45: 0.81138931 70: 0.55834947 95: 0.96267339 

21: 0.84792405 46: 0.65247846 71: 0.99999560 96: 0.38730483 

22: 0.62990076 4T: 0.99935309 72: 0.35544000 97: 0.99991264 

23: 0.97871804 48: 0.38932803 73: 0.99772166 98: 0.64603552 

24: 0.39561006 49: 0.99212598 74: 0.66330362 99: 0.79553432 

25: 0.96559617 50: 0.58273388 75: 0.82216371 100: 0.45025627 


Figure 40.6-D: Ratios P,/I, for degrees n < 100: a random irreducible binary polynomial of prime 


degree n is likely primitive even if n is not a Mersenne exponent. 
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The difference Dn := In — P, is the number of irreducible non-primitive polynomials (see figure|40.6-C). 
If n is the exponent of a Mersenne prime, we have D,, = 0. The complete list of these polynomials up to 


degree 12 inclusive is given in [FXT: data/all-nonprim-irredpoly.txt|. 


Figure[40.6-D]gives the probability that a randomly chosen irreducible polynomial of degree n is primitive. 
A polynomial of prime degree is very likely primitive, so any conjecture suggesting that polynomials of 
a certain type are always primitive for prime degree is dubious: if we take one random irreducible 
polynomial for each prime degree n, then chances are that all of them are primitive. 


40.7 ‘Transformations that preserve irreducibility 


40.7.1 The reciprocal polynomial 
The reciprocal of a polynomial F(x) is the polynomial 
F*(z) = ade" F(1/z) (40.7-1) 


The roots of F*(x) are the inverses of the roots of F(x). The reciprocal of a binary polynomial is the 
reversed binary word: 


1 inline ulong bitpol recip(ulong c) 
2  // Return x^deg(C) * C(1/x) (the reciprocal polynomial) 
3 1 

4 ulong t = 0; 

5 while ( c ) 

6 a 

7 t <<= 1; 

8 t l= (c & 1); 

9 c >>= 1; 

10 

11 return t; 

12 3 


Alternatively, we can use the bit-reversal routines given in section [1.14 on page 33| The reciprocal 
of an irreducible polynomial is again irreducible. The order of the polynomial is preserved under the 


transformation. 


40.7.2 The polynomial p(x + 1) 


If a polynomial p(x) is irreducible, then p(x + 1), the composition with x + 1, is also irreducible. The 
composition with z+1 does not in general preserve order: the simplest example is the primitive polynomial 
p(x) = xt +x? +1 where p(x + 1) = z^ + z? +£? +s +1 has the order 5. The order of x modulo p(x) 
equals the order of x + 1 modulo p(x + 1). The composition with x + 1 can be computed by [FXT: 


bpol/bitpol-irred.h : 


1 inline ulong bitpol compose. xpi(ulong c) 
2  // Return C(x*1). 

2 // Self-inverse. 

5 ulong z = 1; 

6 ulong r = 0; 

T while ( c ) 

9 if (c&1) r =z; 
10 c >>= 1; 

11 z ^= (z««1); 

12 F 

13 return rT; 

14 } 


A faster routine that finishes in time log,(b) (where b = bits per word) is the blue_code() from sec- 


tion|1.19 on page 49 


In general the sequence of successive ‘compose’ and ‘reverse’ operations leads to six different polynomials: 


c= [11, 10, 4, 3, 0] 
[11, 10, 4, 3, 0] -- recip (C=bitpol_recip(C)) --> 
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[11, 
[11, 
[11, 
[11, 
[11, 
[11, 


8, 7, 1, 0] 
10, 9, 7, 6, 5, 
10, 7, 6, 5, 4, 
9, 7, 2, 0] 
9, 4, 2, 0] 
10, 4, 3, 0] 
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-- compose (C-bitpol, compose xpí(C)) --> 


,0] -- recip --> 


1, 0] -- compose --> 


-- recip --> 
-- compose --> 
== initial value 
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Figure 40.8-A: All irreducible self-reciprocal binary polynomials of degree 18 (right) and the corre- 
sponding irreducible polynomials of degree 9 with constant linear coefficient (left). 


A polynomial is called self-reciprocal if it is its own reciprocal. The irreducible self-reciprocal polynomials 
(SRPs), except for 1 + z, are of even degree 2d. They can be computed from the irreducible polynomials 


of degree d with nonzero linear coefficient. Let F(x) 


then 


The irreducible SRPs of degree 18 and their corresponding polynomials 


are shown in figure ES 40.8-A|[FX T: 
gf2n/bitpol-srp-demo.cc|. The conversion can be implemented as [FXT: |bpol/bitpol-srp.h. 


S p(x) 


d 
j=0 


= Es f; xv? and Sp(x) the corresponding SRP, 


(40.8-1) 


1 inline ulong bitpol pol2srp(ulong f, ulong d) 

2  // Return the self-reciprocal polynomial S=x"d*F(x+1/x) where d-deg(f). 
3 // W= sum(j=0, d,  F(j)*x^(d-j)*(1*x^2)^j ) where 

4  //  F(j) is the j-th coefficient of F. 

5 // Must have: d--degree(F) 

6 (t 

7 ulong w = 1; // == (x^2*1)^j 

8 ulong s = 0; 

9 do // fr j=0...d: 

10 { 

11 if (f&1) ^= (w << d); // S += F(j)*x^(d-j)*(1+x^2)^j 
12 w ^= (w<<2); // w *= (1*x^2) 

13 " f >>= 1; // next coefficient to low end 
14 

15 while ( d-- ); 

16 return s; 


«oo-IocuuÉo.[- 
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The inverse function is given in [244]: 
1. Set F := 0 and 7 := 0. 
2. If S mod (z? + 1) = 0, then set f; :— 0, else set fj :— 1. 
3. Set S :— (S — f; 23) / (2? + 1) [the division is exact]. 
4. Set j := j + 1. If j € d goto step 2. 
5. Return F (— xg fj x). 


The computation of S mod (x? +1) can be omitted because the quantity is 0 if and only if the central 
coefficient of S equals 0. The assignment S := (S — f; z4^7)/(x? +1) can be replaced by S := $/(a? +1) 
(as power series) because no coefficient beyond the position d — j is needed by the following steps. We 
use the power series division shown in section 40.1.6 on page 826|for this computation: 

inline ulong bitpol srp2pol(ulong s, ulong hd) 

// Inverse of bitpol_pol2srp(). 

// Must have: hd = degree(s)/2 (note: | half. of the degree). 

// Only the lower half coefficients are accessed, i.e. 


// the routine works for degree(S) <= 2*BITS PER LONG-2. 


1 
ulong f = 0; 
= 1UL << hd; 
= 1; 
ulong b=s & mh; // central coefficient 
// S “= b; // set central coefficient to zero (not needed) 
if (b) f |= ml; // positions 0,1,...,hd 
ml <<= 1; 
S = bitpol div x2pi(s); // exact division by (x”2+1) 


} 
while ( (mh>>=1) ); 


return f; 


The self-reciprocal polynomials of degree 2n are factors of the polynomial z?'*! — 1 (see [251]). For 
example, for n — 5 we find 


? lift(factormod(x^(2^541)-1,2)) 


[x * 1 1] 

[x2 + x + 1 1] 

[x10 + x^7 + x75 + x^3 + 1 1] 

[x10 + x79 + x75 +x + 1 1] 

[x^10 + x^9 + x^8 + x^7 + x^6 + x^b + x^4 + x°3 + x^2 * x * 1 1] 


The order of a self-reciprocal polynomial of degree 2n is a divisor of 2" + 1. The list of all irreducible 


SRP up to degree 22 is given in [FXT: data/all-irred-srp.txt.. 


n: Sn n: Sn n: Sn n: Sn 
1: 1 11: 93 21; 49929 31: 34636833 
2: 1 12: 170 22; 95325 32: 67108864 
3: 1 13: 315 23: 182361 33: 130150493 
4 2 14: 585 24: 349520 34: 252645135 
5: 3 15: 1091 25: 671088 35: 490853403 
6: 5 16: 2048 26: 1290555 36: 954437120 
T: 9 17: 3855 27: 2485504 37: | 1857283155 
8: 16 18: 7280 28: | 4793490 38: | 3616814565 
9: 28 19: 13797 29: 9256395 39: | 7048151355 

10: 51 20: 26214 30: 17895679 40: 13743895344 


Figure 40.8-B: Number of irreducible self-reciprocal polynomials of degree 2n. 
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mi Ta n: T n: T4, n: Th 
J; 1 J: 62 21: 32508 31: 23091222 
2 1 12: 160 22: 76032 32: 67004160 
3 T 13: 210 23: 121574 33: 85342752 
4 2 14: 448 24: 344064 34: 200422656 
5: 2 15: 660 25: 405000 35: 289531200 
6: 4 16: 2048 26: 1005888 36: 892477440 
T: 6 17: 2570 27: | 1569780 37: | 1237491936 
8: 16 18: 5184 28: 4511520 38: 2874507264 
9: 18 19: 9198 29: 6066336 39: | 4697046900 

10: 40 20: 24672 30: 12672000 40: 13690417152 


Figure 40.8-C: Number of primitive self-reciprocal polynomials of degree 2n. 


The number Sn of irreducible SRPs of degree 2n is 


1 
2 n/d E 
Oa = m J u(d) 2 (40.8-2) 
d\n, d odd 


Values of Sn for n < 40 are shown in figure 40.8-B| The sequence of values S,, is entry A000048 in [312]. 


The number of irreducible polynomials of degree n with linear coefficient one is also $;. 


The number S,, of irreducible SRPs of degree n is 


Sn = = DORT OEA (40.8-3) 


d\n, d even 
We have S,, = 0 for n odd and $, = Sn/2 for n even. The number Tn of primitive SRPs of degree 2n is 


e (2^ + 1) 
Log UM 40.8-4 
T, an (40.8-4) 


The sequence of values Tn is entry A069925)in [312], values for n < 40 are shown in figure|40.8-C 


40.9 Irreducible and primitive polynomials of special forms t 


We give lists of irreducible and primitive polynomials of special forms. The abbreviation ‘PP’ is used for 
‘primitive polynomial’ in what follows. The weight of a binary polynomial is the sum of its coefficients. 
Polynomials of low weight allow for cheap modular reduction. 


40.9.1 All irreducible and primitive polynomials for low degrees 
For degrees n < 8 the complete list of irreducible polynomials is shown in figure The list up to 


degree n = 11 is given in [FXT: data/all-irredpoly.txt. The list of PPs for n < 11 is given in [FXT: 
data/all-primpoly.txt|. The list of all irreducible polynomials that are not primitive for n < 12 is given 
in [FXT: data/all-nonprim-irredpoly.txt!. 


40.9.2 All irreducible and primitive trinomials for low degrees 


degrees n < 49 are shown in figure (there are no irreducible trinomials for degrees 50 and 51). A 
list of all irreducible trinomials up to degree n = 400 is given in [FXT: data/all-trinomial-irredpoly.txt . 


A more compact form of the list can is given in [FXT: data/all-trinomial-irredpoly-short.txt|: 


2:. 1 


A trinomial is a polynomial with exactly three nonzero coefficients. The irreducible binary trinomials for 
[ins 
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8,4,3,2,0 
8,5,3,1,0 
8,5,3,2,0 
8,6,3,2,0 
8,6,4,3,2,1,0 
8,6,5,1,0 
8,6,5,2,0 
8,6,5,3,0 
8,6,5,4,0 
8,7,2,1,0 
8,7,3,2,0 
8,7,5,3,0 
8,7,6,1,0 
8,7,6,3,2,1,0 
8,7,6,5,2,1,0 
8,7,6,5,4,2,0 
# non-primitive 
8,4,3,1,0 
8,5,4,3,0 
8,5,4,3,2,1,0 
8,6,5,4,2,1,0 
8,6,5,4,3,1,0 
8,7,3,1,0 
8,7,4,3,2,1,0 
8,7,5,1,0 
8,7,5,4,0 
8,7,5,4,3,2,0 
8,7,6,4,2,1,0 
8,7,6,4,3,2,0 
8,7,6,5,4,1,0 
8,7,6,5,4,3,0 


Figure 40.9-A: All binary irreducible polynomials up to degree 8. 
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35,33 -42,35 
-36, -44, 
36,11 -44,39 
-36,15 -46, 
-36,21 -46,45 
36,25 47,5 
-36,27 47,14 
39, 47,20 
39,8 47,21 
39,14 47,26 
39,25 47,27 
39,31 47,33 
39,35 47,42 
41,3 49,9 
41,20 49,12 
41,21 49,15 
41,38 49,22 
49,27 
49,34 
49,37 
49,40 


Figure 40.9-B: All irreducible trinomials x” + «* + 1 for degrees n < 49. The format of the entries is 


n,k for primitive trinomials and -n,k for non-primitive trinomials. 
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A line starts with the entry for the degree followed by all possible positions of the middle coefficient. 


The corresponding files giving primitive trinomials only are [FXT: data/all-trinomial-primpoly.txt| and 


[FXT: 
[FXT: 


data/all-trinomial-primpoly-short.txt.. 
data/all-trinomial-nonprimpoly.txt!. 


A list of irreducible trinomials that are not primitive is 


Values of n such that an irreducible trinomial of degree n exists are given in sequence |A073571) in y 
Values such that at least one primitive trinomial exists are given in entry A073726, The values n, k for 


primitive polynomials of the form (x + 1)" + (x +1)* + 1 are listed in [FXT: data/all-£1-primpoly.txt. 


850 Chapter 40: Binary polynomials 


Polynomials of that form are irreducible whenever z^ + z* + 1 is irreducible. The list is not the same as 
for primitive trinomials as the transformation p(x) +» p(a+1) does in general not preserve the order. The 
sequence of degrees n such that there is a primitive polynomial (x + 1)" + (x 4- 1) +1 where0 « k «n 


is entry A136416 in [312]. 


Regarding trinomials, there is a theorem by Swan (given in [327]): The trinomial z” + z* + 1 over GF(2) 
has an even number of irreducible factors (and therefore is reducible) if 


1. n is even, k is odd, n Z 2k, and either nk/2 = 0 mod 4 or nk/2 = 1 mod 4, 


2. n is odd, k is even and does not divide 2n, and n = +3 mod 8, 


3. n is even, k is odd and does divide 2n, and n = +1 mod 8, 
4. any of the above holds for k replaced by n — k (that is, for the reciprocal trinomial). 


The first condition implies that no irreducible trinomial for n a multiple of 8 exists (as n is even, k 
must be odd, else the trinomial is a perfect square; and nk/2 = 0 mod 4). Further, if n is a prime with 
n = +3 mod 8, then the trinomial can be irreducible only if k = 2 (or n — k = 2). In the note [106] it is 
shown that no irreducible trinomial exists for n a prime such that n = 13 mod 24 or n = 19 mod 24. 


For some applications one may want to use reducible trinomials whose period is close to that of a primitive 
one. For example, the trinomial 


4141 = (40.9-1) 


(£1! +x? +a +a? +1) (a? +r +a t at? Ha H a? Ha +a a? Hat Ha?) 


has the period p = 4, 292, 868, 097 which is very close to 2?? — 1 = 4, 294, 967,295. Note that the degree 
is a multiple of 8, so no irreducible trinomial of that degree exists. See [82], [126], and [107]. 


40.9.3 Irreducible trinomials of the form 1 + x* + xd 
With each sequence, we give its number as entry in [312]. 


k=1: The trinomial p = 1 + z + z? is irreducible for the following 2 < d < 34353 (sequence A002475): 
2, 3, 4, 6, 7, 9, 15, 22, 28, 30, 46, 60, 63, 
127, 153, 172, 303, 471, 532, 865, 900, 
1366, 2380, 3310, 4495, 6321, 7447, 
10198, 11425, 21846, 24369, 27286, 28713, 32767, 34353 
The trinomials are primitive for the following d < 4400 (sequence A073639): 
2, 3, 4, 6, 7, 15, 22, 60, 63, 127, 153, 471, 532, 865, 900, 1366 
k=2: p= 1 + 2? + 2° is irreducible for the following 3 < d < 57341 (sequence A057460): 


3, 5, 11, 21, 29, 35, 93, 123, 333, 845, 4125, 
10437, 10469, 14211, 20307, 34115, 47283, 50621, 57341 


The trinomials are primitive for all n € 845 (sequence A074710). 


k=3: p = 1 + z? + 2° is irreducible for the following 4 < d < 1000 (sequence A057461): 


4, 5, 6, 7, 10, 12, 17, 18, 20, 25, 28, 31, 41, 52, 66, 
130, 151, 180, 196, 503, 650, 761, 986 


The trinomials are primitive for the following n < 400: 
4, b, 7T, 10, 17, 20, 25, 28, 31, 41, 52, 130, 151, 

k—4: p=1+ xf + x is irreducible for the following 5 < d < 1000 (sequence |A057463): 
7, 9, 15, 39, 57, 81, 105 

The trinomials are primitive for the following n < 400: 7, 9, 15, 39, 81. 

k=5: p = 1 + 2? + xf is irreducible for the following 6 < d < 1000 (sequence |A057474): 


851 


23, 44, 47, 63, 84, 
647, 726, 737, 
129, 236, 278, 279, 297 


17, 20, 


14, 
17, 23, 47, 63, 


12; 
129, 236, 278, 279, 297, 300, 


6,9, 


6, 9, 
The trinomial p = 1+ 24+ z?? is irreducible whenever d is a power of 7, and p = 1+ x4 -- z* is irreducible 


whenever d = 3'5/, i, j € N. Similar regularities can be observed for other forms, see [55]. 


40.9.4 Irreducible trinomials of the form 1 + x4 + x*4 
The trinomial p = 1 + x? + q? is irreducible whenever d is a power of 3: 
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The trinomials are primitive for the following n < 400: 


40.9.5 Primitive pentanomials 
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Figure 40.9-D: Binary primitive polynomials of minimum weight. 
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[1,0] 17,3,0 33,6,4,1,0 49,6,5,4,0 
2,1,0 18,5,2,1,0 34,7,6,5,2,1,0 50,4,3,2,0 
3,1,0 19,5,2,1,0 357270 51,6,3,1,0 
4,1,0 307370 36,6,5,4,2,1,0 527370 

5.2.0 217270 37,5,4,3,2,1,0 53,6,2,1,0 
6,1,0 227170 3876753170 54,6,5,4,3,2,0 
7,1,0 237570 397470 55,6,2,1,0 
8,4,3,2,0 24,4,3,1,0 40,5,4,3,0 56,7,4,2,0 
9.4.0 25.3.0 417370 57,5,3,2,0 
10,3,0 26,6,2,1,0 42,5,4,3,2,1,0 58,6,5,1,0 
11,2,0 27,5,2,1,0 43,6,4,3,0 59,6,5,4,3,1,0 
1276,4,1,0 987370 44,6,5,2,0 60,1;0 
13,4,3,1,0 29.2.0 45,4,3,1,0 61,5,2,1,0 
14,5,3,1,0 30,6,4,1,0 46,8,5,3,2,1,0 62,6,5,3,0 
157170 317370 47°5°0 63.120 
16,5,3,2,0 32,7,5,3,2,1,0 48,7,5,4,2,1,0 64,4,3,1,0 


The data in [FXT: data/minweight-primpoly.txt| lists minimal-weight PPs where in addition the coef- 
ficients are as close to the low end as possible. The first entries are shown in figure |40.9-D| A list of 
minimal-weight PPs that fit into a machine word is given in [FXT: bpol/primpoly-minweight.cc|. 

By choosing those PPs where the highest nonzero coefficient is as low as possible one obtains the list in 


[FXT: . It starts as shown in figure [40.9-E] The corresponding extract for small 
degrees is given in [FXT: bpol/primpoly-lowbit.cc|. The index (position) of the second highest nonzero 
coefficient (the subdegree of the polynomial) grows slowly with n and is € 12 for all n € 400. So we can 
store the list compactly as an array of 16-bit words. 


40.9.7 All primitive low-bit polynomials for certain degrees 


A list of all PPs x” + Dan cj x) for degree n = 256 with the second-highest order k < 15 (and the first 


few polynomials for k = 16) is given in [FXT: data/lowbit256-primpoly.txt|. The first few are 


256,10,5,2,0 


256,12,7, 
256,12,8,7 


Equivalent tables for degrees DEG= 63, 64, 127, 128, 256, 512, 521, 607, 1000, and 1024, can be found in 
the files data/lowbitDEG-primpoly.txt (where DEG has to be replaced by the number). 


40.9.8 Primitive low-block polynomials 


A low-block polynomial has the special form z" + ey x), Such PPs exist for 218 degrees n < 400. 
These are especially easy to store in an array (saving the index of the second highest nonzero coefficient 


in array element n). A complete list of all low-block PPs with degree n < 400 is given in [FXT: data/all- 
lowblock-primpoly.txt|. A short form of the list is [FX T: data/all-lowblock-primpoly-short.txt|. Among 
the low-block PPs are a few where just one bit (the coefficient after the leading coefficient) is not set. 


For n < 400 this is for the following degrees: 
3, 5, 7, 13, 15, 23, 37, 47, 85, 127, 183, 365, 383 


The PPs listed in [FXT: data/lowblock-primpoly.txt| have the smallest possible block of set bits. 


40.9.9 Irreducible all-ones polynomials 


Irreducible polynomials of the form z^" -- z"^1 +g”? +... +x+1 (all-ones polynomials) exist whenever 
n+ 1 is a prime number for which 2 is a primitive root. The list of such primes up to 2000 is shown in 
figure |41.7-B| on page [878] 'The all-ones polynomials are irreducible for the following s « 400: 


1, 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82, 100, 106, 130, 138, 148, 
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162, 172, 178, 180, 196, 210, 226, 268, 292, 316, 346, 348, 372, 378, 388 


The sequence is entry A071642 in [312]. 


With the exception of z?--z--1, none of the all-ones polynomials is primitive. In fact, the order of x equals 
n+1, which is immediate when printing the powers of x (example using n+1 = 5, p= x--z? +z? Ez 1): 


k x^k 
0 AT 

i de 

2 eM 

38. ii 

A fiij " 

5 Sd == yb == 


For computations modulo all-ones polynomials it is advisable to use the redundant polynomial z^"*! + 1 
during the calculations: 


(+e? +r? t... a”) (r) = 14a"*1 (40.9-2) 


One does all computations modulo the product (with cheap reductions) and only reduces the final result 
modulo the all-ones polynomial. 


The all-ones polynomials are a special case for the factorization of cyclotomic polynomials, see sec- 
tion 40.11 on page 857| Irreducible polynomu of high weight are considered in [6] where irreducible 
polynomials of the form (2"*! + 1)/(x + 1) + z^ up to degree 340 are given. 


40.9.10  Irreducible alternating polynomials 


The ‘alternating’ polynomial 1 + IPM g^ — 1 +g +r? +rř... +4 can be irreducible only if d 
is odd: 


d: (irred. poly.) 

1: x73 +x +1 

3: x^! + x75 + x;3* x t1 

5: x^1ll + x 9 + x°7+x75+x°3+x+1 


The list up to d = 1000 (sequence (A107220 in [312]) is 


1, 3, 5, 7, 9, 13, 23, 27, 31, 37, 63, 69, 117, 119, 173, 219, 223, 
247, 307, 363, 383, 495, 695, 987, 


It can be computed (within about ten minutes) via 


for(d=1,1000, p=(1+sum(t=0,d,x” (2*t+1))); if(polisirreducible(Mod(1,2)*p),printi(d,", "))) 


Similar to the all-ones polynomials, a speedup can be achieved by using the redundant modulus 


(rar ara. +a) (142?) = 142042420"? (40.9-3) 


40.9.11 Primitive polynomials with uniformly distributed coefficients 
Primitive polynomials with (roughly) equally spaced coefficients are given in for degrees from 9 
to 660. Polynomials with weight 5 (pentanomials) are given in [FXT: data/eq-primpoly-w5.txt|, the 
polynomials around degree 500 are 


498 372 247 124 O, 499 380 253 125 O, 500 378 250 127 0, 
501 375 255 125 0, 502 370 240 121 0” 


The polynomials with weight 7 are given in [FXT:|data/eq-primpoly-w7.txt), the list for weight 9 is [FXT: 
data/eq-primpoly-w9.txt.. 


40.9.12 Irreducible self-reciprocal polynomials 
A list of all irreducible self-reciprocal polynomials (see section|40.8 on page 846) up to degree 22 is given in 


[FXT: data/all-irred-srp.txt|. These polynomials have even degree and none of them (with the exception 


of x? +x + 1) is primitive. The number after the percent sign with each entry in figure |40.9-F| equals 
(2^/? + 1)/r where r is the order of the polynomial with degree n. 
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% 1 


14,13,12, 


16,15,8,1,0 
16,12,11,8,5,4,0 % 1 
16,13,12,10,8,6,4,3,0 
16,13,8,3,0 % 1 


0 


16,14,13,12,10,8 
16,14,13,12,11, 
16,15,13,11,10, 
16,15,13,12,10, 
16,15,13,9,8,7, 
16,15,14,12,10, 
16,15,14,13,11,1 
16,15,14,13,12,1 
16,15,14,13,9,8, 
16,15,14,8,2,1,0 


9 
8 
9 
3 
8 


Figure 40.9-F: Irreducible self-reciprocal polynomials up to degree 16. 
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Figure 40.9-G: A family fm of irreducible self-reciprocal binary polynomials (top) and matrices Mm 


whose characteristic polynomials are fm (bottom). 
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Given an irreducible polynomial fo(r) = ppm cy z" where cı = 1 and c,-1 = 1 an infinite family of 
irreducible polynomials f(x) of degrees n 2"* can be given as follows [113]: for m > 0 set f(x) = 


S (fm-1(2)) where S (p(z)) = x? p(x + 1/z) and d is the degree of p (relation 40.8-1 on page 846). The 
polynomials fm are self-reciprocal for m > 1. A formula for the number of degree-n polynomials suitable 


as fo is given in [260], see sequence A175390 in [312]. 


Starting with fo = x + 1 we obtain the polynomials shown in figure 40.9-G| The matrices Mm whose 
characteristic polynomials are fm have a simple structure. 


The polynomials fm can be computed in a different way p.63]: set ag = 2, do = 1, a 41 = Gm bm, 
and by y1 = a2, + b2,= (dm + by )?, then fm = am + bm. 


40.9.13 Irreducible normal polynomials 


2,1,0 8,7,2,1,0 -9,8,0 
-877737170 97374,1,0 
3,2,0 8,7,3,2,0 978747970 
-8;7,4,3,2,1,0 9;8;4;3,2,1,0 
4,3,0 -8,7,5,1,0 9787574,0 
-4,3,2,1,0 8,7,5,3,0 9,8,5,4,3,1,0 
-8;7;5,4,0 -0:8;6,3,0 
5,4,2,1,0 -8,7,5,4,3,2,0 9,8,6,3,2,1,0 
574737170 8,7,6,1,0 9,8,6,4;,3,1,0 
5.4,3,2;,0 8,7,6,3,2,1,0 9,8,6,5,3,1,0 
-8,7,6,4,2,1,0 9737625737270 
6,5,0 -8,7,6,4,3,2,0 9,8,6,5,4,1,0 
6,5,2,1,0 3772625227120 973772270 
6,5,4,1,0 -8,7,6,5,4,1,0 9,8,7,3,2,1,0 
-6,5,4,2,0 8,7,6,5,4,2,0 9783775747370 
-8;,7,6,5,4,3,0 9737726227120 
7,6,0 9,8,7,6,3,1,0 
7,6,3,1,0 9737726747270 
7,6,4,1,0 9,8,7,6,4,3,0 
7,6,4,2,0 9737726757120 
7,6,5,2,0 9,8,7,6,5,4,3,1,0 
7,6,5,3,2,1,0 
7,6,5,4,2,1,0 
Figure 40.9-H: All normal irreducible polynomials up to degree n = 9. 
2,1,0 9,8,6,5,4,1, 13,12,10,6,0 
-9/8,6,3,0 13,12;10,7,4,3,0 
3,2,0 9,8,6,3,2,1,0 13,12,10,9;8,3,2,1,0 
13,12,10,9,8,6,4,1,0 
10,9,8,6,3,2,0 13,12,10,9,8,7,6,4,3,2,0 
,4,2,1, 10,9;8;5;,4,3,0 
-10,98,5,3,1,0 14,13,12,10,6,3,2,1,0 
-6,5,4,2,0 10,9,8,6,4,3,0 -14713,12,9,7,5,3,2,0 
:5,4,1,0 -14,13,12,9.8,6,5,2,0 
11,10,8,7,6,5,0 14,13;12,10,8,6,5,4,2,1,0 
7,6,4,1,0 11,10,8,5,2,1,0 -14,;13,12,9,5,3,2 1,0 
117107874737270 14,13;12,10,7,5,À,1,0 
-14;13,12;10;,8;,4,2;1,0 
14,13;12,9,8,1,0 


Figure 40.9-I: All normal polynomials whose roots form a self-dual basis up to degree n — 14. 


The normal irreducible polynomials are those whose roots are linearly independent (see section 


page 900). A complete list up to degree n = 13 is given in [FXT: data/all-normalpoly.txt|, figure |40.9-H 


shows the polynomials up to degree n = 9 (polynomials that are not primitive are marked with a ‘-’). 


Normal polynomials must have subdegree n — 1, that is, they are of the form g” + z"-! +.... The 
condition is necessary but not sufficient: not all irreducible polynomials of subdegree n — 1 are normal. A 
list of primitive normal polynomials z” -- z"^ +... +a” +1 with w as big as possible is given in [FXT: 
data/highbit-normalpoly.txt|. Primitive normal polynomials z” + z"—1 +g” +... +1 where w is as small 
as possible are given in [FXT: data/lowbit-normalprimpoly.txt]. Every irreducible all-ones polynomial is 


normal. 


The polynomials fm (see figure|40.9-G) are normal for all m, they are primitve only for m = 0 and m = 1. 
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All normal polynomials whose roots form a self-dual basis (see section |42.6.4] on page [908] up to degree 
n = 19 are given in [FXT: data/all-irred-self-dual.txt|. The list, up to degree n = 14 is shown in figure 
40.9-I| No such polynomials exist for n a multiple of 4. 


40.10 Generating irreducible polynomials from Lyndon words 


It is not a coincidence that the number of length-n Lyndon words (see section|18.3 on page 379) is equal to 
the number of degree-n irreducible polynomials. Indeed, [95] gives an algorithm that, given one primitive 


polynomial, generates an irreducible polynomial from a Lyndon word: Let b be a Lyndon word, c an 
irreducible polynomial of degree n and a an element of maximal order modulo c. Set e — a^ and compute 
the polynomial pe(x) over GF(2”), defined as 


pelz) := (u—e) (a e?) n E (a e?) e. (£ — e?" 7) (40.10-1) 


Then all coefficients of pe(x) are either zero or one and the polynomial is irreducible over GF(2). 


b e p b e p 
sesat d. m siei Jil 1.P ...14m enel .11..1 P 
1.4 usd 11..1- P ...11 4m oe .11111 
...11 4m sls. .11111 ..1.1 2m «1.11 «1.1.1. red. 
sad usd M .11..1 P .111 4m .111 edant B 
..1.12m 1.11 .1.1.1 red. 1111 1 m sed edhe red: 
..11. 4 .1111 .11111 
..1114m ..111 ols tt. P 
1... 4 .111. add... dP 
.1..14 ..1.1 .11111 
.1.1. 2 «1.4. .1.1.1 red. 
.1.11 4 .11.1 «Lo. 1d... 
.11.. 4 ied .11111 
.11.1 4 «11. dli P 
.111. 4 11.. lli P 
.1111 1 m s.d .1...1 red. 


Figure 40.10-A: Polynomials pe(x) for the powers e = z^ of the primitive element z modulo c = 
z^ + x? +1 (left). If only necklaces are used as exponents b, each polynomial is found only once (right). 
Irreducible polynomials are obtained for aperiodic necklaces. 


An implementation in C++ is given in [FXT: class necklace2bitpol in bpol/necklace2bitpol.h : 


1 class necklace2bitpol 

2 { 

3 public: 

4 ulong p.[BITS PER LONG*1]; // polynomial over GF(2**n ) 

5 ulong n; // degree of c. 

6 ulong c.; // modulus (irreducible polynomial) 

7 ulong h_; // mask used for computation 

8 ulong a.; // generator modulo c 

9 ulong e_; // a^b 

W public: 

12 necklace2bitpol(ulong n, ulong c-0, ulong a-0) 

13 : n. (n), c_(c), a (a) 

14 1 

15 if ( O==c ) c. = lowbit, primpoly [n]; 

16 if ( 0==a ) a = 2UL; // ’x? 

17 h_ = (highest one(c ) >> 1); 

18 } 

20 “necklace2bitpol() 1; + 

22 ulong poly(ulong b) 
1 

24 const ulong e = bitpolmod power(a , b, c_, h_); 

25 e_ = e; 

26 const ulong x = 2; // a root of the C 

27 ulong s = e; 

28 ulong m = 1; // minpoly 


29 for (ulong j=0; j<n_; ++j) 
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31 i ulong t = x ^ s; 

32 m = bitpolmod_mult(m, t, c_, h_); 
33 s = bitpolmod_square(s, c_, h_); 
34 } 

35 bp =m ~* c_; 

36 return bp_; 

37 F 

38 j}; 


The computation of the polynomials is a variant of the second algorithm in section on page [892] 
Figure [40.10-A] (left) shows all polynomials that are generated with c = z^ + z? + 1 and the generator 
a = x. This is the output of [FXT: gf2n/necklace2irred-demo.cc|. The columns are: b and its cyclic 
period (symbol ‘m’ appended if the word is the cyclic minimum), e and p where a “P” indicates that p is 
primitive. Observe that cyclic shifts of the same word give identical polynomials p. Further, if the period 
is not maximal, then p is reducible. Restricting our attention to the necklaces b we obtain each polynomial 
just once (right of figure [40.10-A). The Lyndon words b give all degree-n irreducible polynomials. The 
primitive polynomials are exactly those where gcd(b, 2" — 1) = 1. 


degree: necklaces search 
24: | 276 k/sec | 147 k/sec 
35: | 124 k/sec 64 k/sec 
45: 73 k/sec 36 k/sec 
63: 38 k/sec 18 k/sec 


Figure 40.10-B: Rate of generation of irreducible polynomials via necklaces and with exhaustive search. 


To generate all irreducible binary polynomials of fixed degree use [FXT: class all irredpoly in 


bpol/all-irredpoly.h|. The usage is shown in [FXT: gf2n/all-irredpoly-demo.cc|. It turns out that the gen- 
eration via exhaustive search [FXT: |gf2n/bitpol-search-irred-demo.cc, is not much slower, figure |40.10-B 


gives the rates of generation for various degrees and both methods. 


40.11 Irreducible and cyclotomic polynomials 1 


The primitive binary polynomials of degree n can be obtained by factoring the cyclotomic polynomial 
(see section |37.1.1 on page 704) Yy over GF(2) where N = 2" — 1. For example, with n = 6, 


? n=6; N-2^n-1; lift( factormod(polcyclo(N),2) ) 
[x6 +x+1 1] 
[x6 + x4* x83*x-*1 1] 
[x6 + x5 + 1 1] 
[x6 + x5b + x°72+x+1 1] 
[x6 + x75 + x73 + x72 +1 1] 
[x6 + x75 + x°4+x4+1 1] 


We use a routine (pcfprint(N)) that prints the N-th cyclotomic polynomial and its factors in symbolic 
form. With n = 6, N = 2” — 1 = 63 we obtain 
? n=6; N=2"n-1; 
? pcfprint(N) 
63: [ 36 33 27 24 18 12930 ] 


[610] 
[64310] 
[650] 
[65210] 
[65320] 
[65410] 


The irreducible but non-primitive binary polynomials are factors of cyclotomic polynomials Yq where 
d\N, d < N and the order of 2 modulo d equals n: 


? fordiv(N,d,if(n--znorder(Mod(2,d)) && (d<N), pcfprint(d) )); 


9:[630] 
630] 
21: [12119864310] 
[64210] 
[65420] 


COBANIMWTIBPWNHH 


OoN O: Ot i» C2 b2— 


858 Chapter 40: Binary polynomials 


The number of factors of Yq equals y(d)/n so we can count how many degree-n irreducible polynomials 
correspond to which divisor of N — 2" — 1: 


1: [1:1] 1 

2: [3:1] 1 

3: [7:2] 2 

4: [5:1] [15:2] 3 

5: [31:6] 6 

6: [9:1] [21:2] [63:6] 9 

T: [127:18] 18 

8: [17:2] [51:4] [85:8] [255:16] 30 
9: [73:8] [511:48] 56 

10: [11:1] [33:2] [93:6] [341:30] [1023:60] 99 
11: [23:2] [89:8] [2047:176] 186 


Line 6 tells us that one irreducible polynomial of degree 6 is due to the factor 9, two are due to the 
factor 21, and the 6 primitive polynomials correspond to N — 63 itself, which we have verified a moment 
ago. Further, the a polynomials corresponding to an entry [d:a] all have order d. The list was produced 
using 


{ for (n=1, 11, 
printi(n,": "); 
s = 0; 
N = 2°n-1; 
fordiv (N, d, 
if ( n==znorder(Mod(2,d)) , 
a = eulerphi(d)/n; 
printi(" Pd "3,11 "); 
s += aj 
); 
); 
print (" "Vs)3 
2): + 


40.12  Factorization of binary polynomials 


We give a method for the factorization of binary polynomials. The first part describes how to factorize 
polynomials that do not contain a square factor. The second part gives algorithms to detect and remove 
square factors. Finally, an algorithm to factorize arbitrary binary polynomials is given. 


40.12.1  Factorization of square-free polynomials 


A polynomial that does not contain a nontrivial square factor is called square-free. 'To factorize a square- 
free polynomial, we will use Berlekamp's Q-matriz algorithm described in [46]. The algorithm consists 
of two main steps: the computation of the nullspace of a matrix and a refinement phase that finds the 
distinct irreducible factors. 


Let c be a binary polynomial of degree d. The Q-matrix is a d x d matrix whose n-th column can be 
computed as the binary polynomial z?" (mod c). The algorithm will use the nullspace of Q — id. 


The routine to compute the Q-matrix is [FXT: bpol/berlekamp.cc|: 
void 
setup q matrix(ulong c, ulong d, ulong *ss) 
// Compute the Q-matrix for the degree-d polynomial c. 
// Used in Berlekamp’s factorization algorithm. 


1 
ulong h = 1UL << (d-1); 
1 


ulong x2 = 2UL; // == ^x? 

ulong q = 1UL; 

x2 = bitpolmod_mult(x2, x2, c, h); 
for (ulong k-0; k<d; ++k) 

{ 


ss[k] = q; 
q = bitpolmod mult(q, x2, c, h); 


bitmat transpose(ss, d, ss); 
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c= x" +x +1 (irreducible): 


= Q-id= nullspace= 
Lesat R Dosen 
saklas eds 
1-21 Pe ME 
eddie 1. sae deds 
1... 1211. 
— 1 sold 
1..1 1.2 


c— z' +r’ +r ++1 = (24D) (22+2+1) (24 +x +1) (reducible but square-free): 


= Q-id= nullspace= 
Tecate, diee Toss 
1.1 21:21:21 oo oe 
1..1.1 31.1.1 1.1.11 
wit Bo 1. weeded. 
1.111 wifi etl 
ee S 11 
1.11 cape die d. 


= Q-id= 
Leds A ss nullspace= 
e edis Tires 
1 1 1. 
iid vx 
1.1.1 
cus 1. 
1.41 


Figure 40.12-A: The Q-matrices for three binary polynomials and the nullspaces of Q — id. 


The Q-matrix and nullspace of Q — id for the irreducible binary polynomial c = z^ + z + 1 are shown at 
the top of figure [40.12-A] The vector no = [1,0,...,0] lies in the nullspace of Q — id for every polynomial. 
For c = z? +x? +x +1 = (x+ 1)(x? +x + 1)(xt +s +1) the nullspace has rank three (middle of 
figure [40.12-A). the rank of the nullspace equals the number of distinct irreducible factors if c is square- 
free. For polynomials containing a square factor we do not get the total number of factors, the data for 
c=(14+2 = a? +0 + g5 + xt + g? +g? + grl is shown at the bottom of figure 40.12-A] The figure 


was created with the program [FXT: |gf2n/qmatrix-demo.cc). 


To find the irreducible factors of c, the vectors spanning the nullspace must be post-processed. The 
algorithm can be described as follows: let F be a set of binary polynomials whose product equals c, the 
refinement step R; proceeds as follows: 

Let t be the i-th element of the nullspace. For each element f € F do the following: if the degree of f 
equals 1, keep it in the set, else remove f from the set and add from the set X = (gcd(/, t), gcd( f, t 4- 1)) 
those elements whose degrees are greater than or equal to 1. 


One starts with F = {c} and does the refinement steps Ro, R1,... R-—1 corresponding to the vectors of 
the nullspace. Afterwards the set F will contain exactly the distinct irreducible factors of c. T'his is done 


in the following routine [FXT: bpol/berlekamp.cc|: 


1  ulong 

2  bitpol refine factors(ulong *f, ulong nf, const ulong *nn, ulong r) 
3 // Given the nullspace nn[0,...,r-1] of (Q-id) 

4  // and nf factors f[0,...,nf-1] whose product equals c 

5 // (typically nf=1 and f[0]==c) 

6  // then get all r irreducible factors of c. 

7 1 

8 ulong ss[r]; 

9 for (ulong j=0; j<r; ++j) // for all elements t in nullspace 
10 1 

11 ulong t - nn[jl; 

13 // Skip trivial elements in nullspace: 

14 if ( bitpol deg(t)--0 ) continue; 


là ulong sc = 0; 
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17 for (ulong b=0; b<nf; ++b) // for all elements bv in set 


19 ulong bv = f[b]; 
20 ulong db = bitpol_deg(bv) ; 
21 a (db <= 1 ) // bv cannot be reduced 


23 ss[sc++] = bv; 
} 
25 co 
27 for (ulong s=0; s<2; ++s) // for all elements in GF(2) 


29 ulong ti = t ^ sj 
30 ulong g = bitpol gcd(bv, ti); 
31 if ( bitpol deg(g) >= 1) ss[sc++] = g; 


33 + 
34 } 


3g nf = sc; 


37 for (ulong k-0; k<nf; ++k) f[k] = ss[k]; 


39 if ( nf»-r ) break; // done 
40 } 


1) return nf; 
43 } 


We skip elements corresponding to constant polynomials. Further, as soon as the set F contains r 
elements, all factors are found and the algorithm terminates. 


Now Berlekamp’s algorithm can be implemented as 


1  ulong 

2 bitpol factor squarefree(ulong c, ulong *f) 

3  // Fill irreducible factors of square-free polynomial c into f[] 
4  // Return number of factors. 

5 

6 ulong d = bitpol_deg(c) ; 

T 

8 if ( d<=1 ) // trivial cases: 0, 1, x, x+1 
9 

10 f[0] = c; 

11 if ( 0==c ) d= 1; // 0==0"1 

12 return d; 

13 

14 

15 ulong ss[d]; 

16 setup_q_matrix(c, d, ss); 

17 bitmat_add_unit (ss, d); 

18 

19 ulong nn[d]; 

20 ulong r = bitmat_nullspace(ss, d, nn); 

23 f[0] = c; 

24 ulong nf = 1; 

25 if ( r>1 ) nf = bitpol_refine_factors(f, nf, nn, r); 
38 return r; 

28 } 


The algorithm for the computation of the nullspace was taken from [213], find the implementation in 
[FXT: bmat/bitmat-nullspace.cc!. 


Berlekamp’s algorithm is given in in a more general form: to factorize a polynomial with coefficients 
in the finite field GF(q), set up the Q-matrix with columns 2%? and set X = {gced(f,t + 0), ged( f, t + 
1),...,ged(f, t + (q — 1))) in the refinement step. The algorithm is efficient only if q is small. 


40.12.2 Extracting the square-free part of a polynomial 


To test whether a polynomial c has a square factor one computes g = gcd(c, c’) where c’ is the derivative. 


If g 4 1, then c has the square factor g: let c = a- b?, then d = a' b? +a2bb! = a' b?, so ged(c, 7) = b?. 
The corresponding routine for binary polynomials is given in [FXT: |bpol/bitpol-squarefree.h|: 
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inline ulong bitpol_test_squarefree(ulong c) 
// Return 0 if polynomial is square-free 
// else return square factor !- 0 


1 
ulong d = bitpol deriv(c); 
if ( O==d ) return (1-2c?0 : c); 
ulong g = bitpol gcd(c, d); 
return (1==g ? 0 : g); 


0 0-0 Ou c5 b2 n 


} 


The derivative of a binary polynomial can be computed easily [FXT: bpol/bitpol-deriv.h) (64-bit version): 


inline ulong bitpol_deriv(ulong c) 
// Return derivative of polynomial c 


c &= OxaaaaaaaaaaaaaaaaUL; 
return (c>>1); 


Oopwnr 


The coefficients at the even powers have to be cleared because derivation multiplies them with an even 
factor which equals 0 modulo 2. 


If the derivative of a binary polynomial is zero, then it is a perfect square or a constant polynomial: 


inline ulong bitpol_pure_square_q(ulong c) 
// Return whether polynomial is a pure square != 1 
{ 


if ( 1UL==c ) return 0; 
c &= OxaaaaaaaaaaaaaaaaUL; 
return (0==c); 


-1O»O0uR0Lb.-c 


The following routine returns zero if c is square-free. If c is has a square factor s Z 1, then s is returned: 


1 inline ulong bitpol. test squarefree(ulong c) 
2 A 

3 ulong d = bitpol deriv(c); 

4 if ( O==d ) return (1-2c ? 0: c); 

5 ulong g = bitpol gcd(c, d); 

6 return (1==g ? 0 : g); 

T} 


If a polynomial is a perfect square, then its square root can be computed as 


1 inline ulong bitpol_pure_sqrt (ulong c) 

2 A 

3 ulong t = 0; 

4 for (ulong mc=1,mt=1; mc; mc<<=2,mt<<=1) 
5 { 

6 if (mc& c) t |= mt; 

T 

8 return t; 

9 P 


A faster way to do the computation is to use the function bit, unzipO() from section |1.15|on page [38] 
For the factorization algorithm for general polynomials we have to extract the product of all distinct 


irreducible factors (the square-free part) from a polynomial. The following routine returns a polynomial 
where the even exponents in the factorization are reduced [FXT: |bpol/bitpol-squarefree.cc : 


1  ulong 

2 bitpol sreduce(ulong c) 

3 1 

4 ulong s = bitpol test squarefree(c); 

5 if ( 0==s ) return c; // c is square-free 
6 

7 ulong f = bitpol_div(c, s); 

8 

9 do // here s is a pure square and s>1 
10 1 

11 S = bitpol. pure. sqrt (s); 

12 } 

13 while ( bitpol_pure_square_q(s) ); 

14 

15 ulong g = bitpol gcd(s, f); 


16 S = bitpol div(s, g); 


17 
ib 
20 


OQO WON e 


«o0 =_IDIA NA 
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f = bitpol_mult(f, s); 


return f; 


} 


With c= f .s* 2 (k odd, the factors of f and s not necessarily distinct) the returned polynomial equals 
f- s". Some examples: 


a => a (40.12-1a) 

a^ c a (40.12-1b) 

a =aa + aa (40.12-1c) 
a?—aa^ > aa (40.12-1d) 
ab? > ab (40.12-1e) 
abb? ++ abb-al? > ab (40.12-1f) 
fs n fish (40.12-1g) 


To extract the square-free part of a polynomial call the routine repeatedly until the returned polynomial 
equals the input: 


inline ulong bitpol. make squarefree(ulong c) 

1 
ulong z = c, t; 
while ( z!=(t=bitpol_sreduce(z)) ) z = t; 
return Z; 

} 


The reduction routine will be called at most log,(n) times for a polynomial of degree n: the worst case 


is a perfect power p = a2 -! where 2* — 1 € n. Observe that (2* — 1) = 1-- 2 (2*-! — 1), so the reduction 


DR, 


routine will split p as p = as? — as where s =a is of the same form. 


40.12.3 Factorization of arbitrary polynomials 


The factorization routine for arbitrary binary polynomials extracts the square-free part f of its input c, 
uses Berlekamp’s algorithm to factor f and updates the exponents according to the polynomial s = c/f. 


There is just one call to the routine that computes a nullspace [FXT: |bpol/bitpol-factor.cc : 


ulong 
bitpol factor(ulong c, ulong *f, ulong *e) 
// Factorize the binary polynomial c: 
// c = Nprod (i-0) (fct-1)Hf[il^e[il) 
// The number of factors (fct) is returned. 
ulong d = bitpol_deg(c) ; 
if ( d<=1 ) // trivial cases: 0, 1, x, x+1 
f[0] = c; 
if ( 0==c ) d= 1; // 0==071 
return d; 


// get square-free part: 
ulong cf = bitpol_make_squarefree(c) ; 


// ... and factor it: 
ulong fct - bitpol factor squarefree(cf, f); 


// All exponents are one: 
for (ulong j=0; j<fct; ++j) 4 eljl = 1; } 


// Here f[],e[] is a valid factorization of the square-free part cf 


// Update exponents with square part: 
ulong cs = bitpol_div(c, cf); 
for (ulong j=0; j<fct; ++j) 


if ( 1==cs ) break; 
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ulong fj = f[jl; 
ulong g = bitpol_gcd(cs, fj); 
while ( 1!=g ) 
T 
*te[jl; 
cs = bitpol div(cs, fj); 
if ( 1==cs ) break; 
Eg = bitpol gcd(cs, fj); 
} 
} 
return fct; 
} 
0: x75 = 
1: x75 +1 == 
2: x75 +x == 
3: x75 +x+1 == 
4: x75 +x72 == 
5: x^b +x72 +1 == 
6: x75 +x72+x == 
T: x^b +x72+xt+1 == 
8: x75 +x73 == 
9: x75 +x73 +1 == 
10: x75 +x73 +x == 
11: x^b +x73 +x+1 == 
12: x75 +x73+x72 == 
13: x75 +x73+x72 +1 == 
14: x75 +x734+x72+x == 
15: x^b Tx^34x^24x41 == 
16: x75+x74 == 
17: x75+x74 +1 == 
18: x75+x74 +x == 
19: x75+x74 +x+1 == 
20: x75+x74 +x72 == 
21: x75+x74 +x72 +1 == 
22: x75+x74 +x72+x == 
23: x75+x74 +x^2+x+1 == 
24: x75+x74+x73 == 
25: x754+x74+x73 +1 == 
26: x75+x74+x73 +x == 
27: x75+x74+x73 +x+1 == 
28: x75+x74+x73+x72 == 
29: x^btx^44x^834x^2 +1 == 
30: x^btx^4*x^83*x^2*x == 
31 x^btx^44x^834x^24x41 == 


(x) 75 
(x+1) * 
(x) * (x+1)74 

(x^24x-41) * (x73+x72+1) 
(x)2* (x+1) * (x^2*x*1) 
(x75+x72+1) 

(x) * (x74+x+1) 

(x*1)^2 * (x73+x+1) 
QGO^3* (x+1)72 
(x75+x73+1) 

(x) * (x"2+x+1)72 

(G1) * (x74+x73+1) 

(x)^2 * (x73+x+1) 

(x+1)73 * (x"2+x+1) 

(x) * (xt) * (x734+x72+1) 
(x75+x73+x72+x+1) 

(x)7*4 * (x+1) 

(x^24x-41) * (x73+x+1) 

(x) * (x74+x73+1) 

(x+1)75 
(x)72 * 


(x7 44+x73+x72+x+1) 


(x^3*x^241) 
(x1) * (x74+x+1) 
(x) * (x*1)72 * 
(x75+x74+x72+x+1) 
(x)73 * (x72+x+1) 
(x-1)^2 * (x73+x72+1) 
(x) *  (x*1) * (x73+x+1) 
(x75+x74+x73+x+1) 

(x)72 * (x+1)73 
(x^5b*x^4*x^34x^241) 

(x) * (x74+x73+x72+x+1) 
(x1) * (x72+x+1)72 


(x^2*x*1) 


Figure 40.12-B: Factorizations of the binary polynomials of degree 5. 


Figure |40.12-B| shows the factorizations of the binary polynomials of degree 5. It was created with the 


program [FXT: gf2n/bitpolfactor-demo.cc). Factoring the first million polynomials of degrees 20, 30, 40 


and 60 takes about 5, 10, 15 and 30 seconds, respectively. 


A variant of the factorization algorithm often given uses the square-free factorization c = [[; ai where 
the polynomials a; are square-free and pair-wise coprime. Given the square-free factorization one has to 


call the core routine for each nontrivial a;. 


As noted, the refinement step becomes expensive if the coefficients are in a field GF(q) where q is not 
small because q computations of the polynomial gcd are involved. For an algorithm that is efficient also 
for large values of q see [110] or ch.14]. A *baby step/giant step' method is given in [308]. 
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Chapter 41 


Shift registers 


We describe shift register sequences (SRS) and their generation via linear feedback shift registers (LFSR). 
The underlying mechanism is the modular arithmetic of binary polynomials treated in section [40.3] 
We give an expression for the number of shift registers sequences of maximal length, the m-sequences. 
We look at two related mechanisms, feedback carry shift registers (FCSR) and linear hybrid cellular 
automata (LHCA). Most of these algorithms given can easily be implemented in hardware. Among the 
many applications for shift registers are random number generators, the computation of CRCs, spectrum 
spreading with communication protocols, and hardware testing. 


41.1 Linear feedback shift registers (LFSR) 


c = 1..11 == 0x13 == 19 (deg = 4) 
0 w= 15 = 1111 n 1=a 
1 w= 14 = 111. subs = 2=a 
2 w= 12 = 11. ee 4=a 
3 w= 8 =1.. 1... = 8=a 
4 w= uL S EA 37a 
5 w= de eiu x alee = 6=a 
6 w= 4 =.1.. . 11..- 12a 
T w= 9 =1..1 1 1.11 = li=a 
8 w= 3 =..11 1 .1.1 = 5=a 
9 w= 6 = .11. . 1.1. = 10=a 
10 w= 13 = 11.1 1 .111 = T=a 
11 w= 10 =1.1. . 111. = 14=a 
12 w= 5 -.1.1 1 1111 = 15=a 
13 w= 11 -1.11 1 11.1 = 13=a 
14 w= 7 = .111 1 1..1- 9=a 
>> 0 w= 15 = 1111 1 ...1= 1 =a << new period starts 

1 w= 14 = 111. . ..1.- 27a 
2 w= 12 = 11.. Lo. S 4=a 
3 w= 8 = 1. 1: 87a 
4 w= 1 =...1 1 .,11 = 3=a 
5 w= 2 = 1 ede, = 6=a 


Figure 41.1-A: Linear feedback shift register using the primitive polynomial C = z* + z +1. 


Multiplication of a binary polynomial A by x modulo a polynomial C is particularly easy as shown near 
the beginning of section H0.3 on page 832] shift the input to the left (multiplication); if the result A- x 
has the same degree as C, then subtract (XOR) the polynomial C (modular reduction). 

The underlying mechanism of shifting and conditionally feeding back certain bits is called a linear feedback 
shift register (LFSR). A shift register sequence (SRS) can be generated by computing A, = z^", k = 
0,1,...,2” — 1 modulo C and setting bit k of the SRS to the least significant bit of Ax. In the context 
of LFSRs the polynomial C is sometimes called the connection polynomial of the shift register. 


If the modulus C is a primitive polynomial (see section |40.5 on page 841) of degree n, then the SRS is a 


sequence of zeros and ones that contains all nonzero words of length n. Further, if a word W is updated 
at each step by left shifting and adding the bit of the SRS, this sequence also contains all nonzero words. 


This is demonstrated in [FXT: gf2n/lfsr-demo.cc|, which for n = 4 uses the primitive polynomial C = 


COON O04 UnA 


E 


OoN OOA UnA 


1 
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z^ +x 4- 1 and gives the output shown in figure 41.1-A| Here we pasted the first few lines after the end 
of the actual output to emphasize the periodicity of the sequences. The corresponding SRS of period 15 
is (extra spaces mark start of new periods): 


100010011010111 100010011010111 1 0 2 


In fact any of the bits of the words Aj, = x* mod C (or linear combination of two or more bits) could be 
used, each producing a cyclically shifted version of the SRS. 


An efficient way to generate an SRS is to compute the powers z^^ modulo C, that is, to repeatedly divide 
by z: 


/* a primitive polynomial */; 
ulong n - /* degree of C */; 

ulong a = 1; 

for (ulong k-0; k<n; ++k) 

1 


ulong c 


ulong s =a & 1; 
// Use s here. 

if (s) a^-2c; 
a>> 1; 


} 


The routine will work for deg(C) « n where n is the number of bits in a machine word. A version that 
also works for deg(C) = n is given in section |40.3 on page 832 
A C++ implementation of an LFSR is [FXT: class lfsr in bpol/lfsr.h : 


class lfsr 
// (binary) Linear Feedback Shift Register 


// Produces a shift register sequence (SRS) 
// generated by a k-x^k (mod c) where 

// c is a primitive polynomial of degree n. 
// The period of SRS is 2°n - 1 

// (non-primitive c lead to smaller periods) 


1 
public: 
ulong a_; // internal state (polynomial modulo c) 
ulong w_; // word of the shift register sequence (SRS) 
ulong c.; // (mod 2) poly e.g. x"4+x+1 == 0x13 == 1..11 
ulong h.; // highest bit in SRS word e.g. (above) == 16 = 1... 
ulong mask ; // mask e.g. (above) == 15 == 1111 
ulong n.; // degree of polynomial e.g. (above) == 


public: 
lfsr(ulong n, ulong c-0) 
// n: degree of polynomial c 
// c: polynomial (defaults to minimum weight primitive polynomial) 


[--snip--] 


The crucial computation is implemented as 


ulong next() 
ulong s = a_ & h_; 
a_ <<= 1; 
w_ <<= 1; 
if ( O!=s ) 
{ 
a. ^9 c. 
w_ l= 1; 


w_ &= mask_; 
return w_; 


} 


Up to the lines that update the word w_ this function is identical to bitpolmod_times_x() given in 
section |40.3 on page 832 
The method next_w() skips to the next word by calling next () n times: 


ulong next w() 
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for (ulong k-0; k<n_; ++k) next(); 
return w_; 


OUR who 


Let a and w a pair of values that correspond to each other. The following two methods directly set one 
of these two while keeping the pair consistent. This routine sets a to a given value: 


1 void set a(ulong a) 
2 { 

3 a =a; 

4 w = 0; 

5 ulong b = 1; 

6 for (ulong j=0; j<n_; ++j) 
7 { 

8 if(agv 1) 
9 1 

10 w_ l= b; 
11 a mu: 
12 } 

13 b <<= 1; 

14 a >>= 1; 

15 } 

16 } 


The loop executes n times where n is the degree of the modulus. This routine sets w to a given value: 


1 void set_w(ulong w) 
2 { 

3 W_ = W; 

4 a = 0; 

5 ulong c = c_; 
while ( w ) 

8 if (w&1) a. ^-2c; 
9 c <<= 1; 
10 w >>= 1; 

11 } 

12 a_ &= mask_; 

13 } 


The supplied value must be nonzero for both methods. 


Going back one step is possible via the method prev () 


1 public: 
2 ulong prev() 
3 
4 prev. aO; 
5 set a(a ); 
6 return w_; 
7 } 
which calls prev. aO: 
1 private: 
2 void prev_a() 
3 { 
4 ulong s = a_ & 1; 
5 a >>= 1; 
6 if (s) 
T { 
8 a. ^= (c_>>1); 
9 a. | h_; // so it works for n. == BITS. PER. LONG 
10 } 
11 } 


The method prev_a() leaves the value of w inconsistent with a and therefore cannot be called directly. 
Note that stepping back is more expensive than stepping forward because set_a() is rather expensive. 


It is also possible to go backwards word-wise: 
ulong prev_w() 
for (ulong k=0; k<n_; ++k) prev_a(); 


set_a(a_); 
return w_; 
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6 } 


As this routine involves only one call to set_a() it is about as expensive as stepping one word forward 
using next_w(). 


41.2 Galois and Fibonacci setup 


c = .11..1 == 0x19 == 25 (deg = 4) 

r= .1..11 == 0x13 == 19 (deg = 4) 
-------- Galois ------------- --------- Fibonacci --------- 
k: Lc r Rc Rr Lc Lr Rc Rr 
1: "I PE i Hatt clk —! TG ess d d raand 
2: ae ks Sew dos d... .1..1 Sere ee ened doces zd. 
3: Pu PI uds el. .11.1 MAC ..111 .11.. E 
4: E t pe AS sa dl .1111 21:1 .1111 1441. sedi. 
5: ¿dal weed i «11.1 .111. sas deL .111. .1111 12:1 
6: :1.11 «dua sli. «111 x did s «11.1 ¿141 zd 
T: .1111 sl. sud 21:1. .11.1 ek eis .1.11 ..11. 
8: ..111 .1.11 .111. ..1.1 1.1. cadet edd .1.11 
9: .111. ..1.1 ..111 .1.11 udo .1.11 LL sed ek 
10: sadet .1.1. .1111 sit... 21.11 zatia .11.1 .1.41. 
11: .1.1. ..111 21.41 sd d. s. hd 2115 eli: .11.1 
12: .11.1 .111. ¿Lol vee dd .1111 .1..1 evil .111. 
13: ses iL .1111 Ep PE ET .111. MEER stead .1111 
14: <ii .11.1 zad zud. sls Tto. cos ..111 
15: T1... 21,1 1 Ered alu 1.2 ated «s.d 41 
16: xaxd conus di 1 acne sre sesso. 1 1 eed 


Figure 41.2-A: Sequences of words generated with the Galois and Fibonacci mechanisms, either with 
the left or the right shift (capital letters ‘L’ and ‘R’ on top of columns) and primitive polynomial ‘c’ or 
its reciprocal ‘r’. Each track of any sequence is a shift register sequence. 


The type of shift registers considered so far is the Galois setup of a binary shift register. The mechanism 
is to detect whether a one is being shifted out and, if so, subtract the polynomial modulus. The auxiliary 
variable h must be the word where only bit n — 1 is set where n is the degree of the polynomial c. The 
left and right shift operations can be implemented as 


1  ulong galois left(ulong x, ulong c, ulong h) 
2 { 
3 ulong s = x € h; 
4 x <<= 1; 
5 if ( O!=s ) x “= c; 
6 return x; 
7 4 
and 
1  ulong galois right(ulong x, ulong c) 
2 
3 ulong s = ( x & 1UL ); 
4 x >>= 1; 
5 if (s) x “= (c>>1); 
6 return x; 
7 4 


Four sequences of binary words that are generated with either the left or right shift and a primitive 
polynomial or its reciprocal (reversed word) are shown in figure[41.2-A] A different set of sequences shown 
in the same figure is obtained with the Fibonacci setup. In the Fibonacci setup the sum (modulo 2) of 
bits determined by the used polynomial is shifted in at each step. 


The left and right shift operations can be implemented as 
ulong fibonacci_left(ulong x, ulong c, ulong h) 


x <<= 1; 

ulong s = parity( x & c ); 

if ( O!=s ) x ^= 1; 

x &- ^(h««1); // remove excess bit at high end 
return x; 


ONDT AWN — 
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can be used to create the binary patterns shown in figure 
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and 
ulong fibonacci_right(ulong x, ulong c, ulong h) 
{ 
ulong s = parity( x & c ); 
x >>= 1; 
if (s ) x ^s h; 
return x; 
} 


As the parity computation is expensive on most machines (see section |1.16.1 on page 42), the Galois 


setup should usually be preferred. The programs [F XT: gf2n/lfsr-fibonacci-demo.cc| and |FXT: gf2n/lfsr- 
41.2-A| With both programs 


the polynomial modulus can be specified. 


41.3 Error detection by hashing: the CRC 


A hash value is an element from a set H that is computed via a hash function f that maps any (finite) 
sequence of input data to H: 


f:S- dH, seh (41.3-1) 


where s € S and h € H. For the sake of simplicity we now consider input sequences of fixed size, so they 
are in a fixed set S. We further assume that the set S is (much) bigger than H. 


Input sequences with different hash values are necessarily different. But, as the hash function maps a 
bigger set to a smaller one, there are different input sequences with identical hash values. 


A trivial example is the set H = (0, 1} together with a function that counts binary digits modulo 2, the 
parity function. Another example is the sum-of-digits test (see section [28.4 on page 562), that is used 
to check the multiplication of large numbers. In the test we compute the value of a multi-digit decimal 
number modulo 9, so H = {0,1,2,...,9}. The crucial additional property of this hash is that with 
f(A) =a, f(B) =b, f(C) = c (where A, B, and C are decimal numbers), A- B = C implies a b = c. 


To be useful, a hash function f should have the mixing property: it should map the elements s € S 
‘randomly’ to H. With the sum-of-digits test we could have used rather arbitrary moduli for the hash 
function. With one exception: the value modulo 10 as hash would be rather useless as no change in any 
digit except for the last could ever be detected. 


The cyclic redundancy check (CRC) is a hash where the hash values are binary words of fixed length. 
The hash function (basically) computes h = s mod c where s is the binary polynomial corresponding to 


the input sequence and c is a binary polynomial that is primitive (see chapter|40 on page 822). We will 
use polynomials c of degree 64 so the hash values (CRCs) are 64-bit words. 


A C++ implementation is given as [FXT: class crc64 in bits/crc64.h|: 


class crc64 
// 64-bit CRC (cyclic redundancy check) 
1 
public: 
uint64 a_; // internal state (polynomial modulo c) 
uint64 c, ; // a binary primitive polynomial 
// (non-primitive c lead to smaller periods) 
// The leading coefficient needs not be present. 
uint64 h ; // auxiliary 
static const uint64 cc[]; // 16 "random" 64-bit primitive polynomials 
public: 
(goa uu0e c-0) 
if ( 0-2c ) c = Ox1bULL; // ="= 64,4,3,1,0 (default) 
init(c); 


“erc64() 4{;} 
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21 

22 void init (uint64 c) 

23 { 

24 c.c; 

25 c. >>= 1; 

26 h_ = 1ULL««63; 

27 c. l= h_; // leading coefficient 
28 reset (); 

29 } 

30 

31 void reset() 4 set_a(~OULL); } // all ones 
32 void set a(uint64 a) i a -a; } 

34 uint64 get a() const ( return a_; } 
35 [--snip--] 


Note that a nonzero initial state (member variable a) is used: starting with zero will only go to a nonzero 
state with the first nonzero bit in the input sequence. That is, input sequences differing only by initial 
runs of zeros would get the same CRC. 


Individual bits can be fed into a CRC using the method bit. in(b), the lowest bit of the argument b is 
used: 


1 [--snip--] 

2 void shift() 

3 1 

4 bool s = (a & 1); 

5 a >>= 1; 

6 if ( 0!=s ) a 7=c_; 
5 

9 uint64 bit_in(unsigned char b) 
10 { 

11 a_ ^= (b&1); 

12 shift(); 

13 return a_; 

14 

15 [--snip--] 


For checksumming a byte, we can do better than just feeding in the bits one by one: 


1 [--snip--] 

2 uint64 byte in(unsigned char b) 

3 1 

1 dif 1 . 

a ^= b; 

6 shift(); shift(); shift(); shift(); 

T shiftO; shiftO; shiftQ; shiftO; 

8 #else // identical but slower: 

9 bit in(b); b>>=1; // bit 
10 bit in(b); b>>=1; // bit 
11 bit in(b); b>>=1; // bit 
12 bit in(b); b>>=1; // bit 
13 bit in(b); b>>=1; // bit 
14 bit in(b); b»»-1; // bit 
15 bit in(b); b>>=1; // bit 
bit in(b); b>>=1; // bit 
19 
20 


NXNJONFUNRO 


#endif 
return a_; 


F 

[--snip--] 
The lower block implements the straightforward idea. The program [FXT: bits/crc64-demo.cc| computes 
the 64-bit CRC of a single byte in both ways. 
Binary words are fed in byte by byte, starting from the lower end: 

uint64 word_in(uint64 w) 


ulong k = BYTES_PER_LONG_LONG; 
while ( k-- ) { byte_in( (uchar)w ); w>>=8; } 
return a_; 


Ook wn 


To feed in a given number of bits of a word, use the following method: 


1 uint64 bits_in(uint64 w, uchar k) 


PRR RRR RRR 
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e Feed in the k lowest bits of w 
if (kki ) da ^= (w&); w >>= 1; shiftO; } 
k >>= 1; 
if ( kki ) { aL ^= (w&3); w >>= 2; shiftO; shiftO; } 
k >>= 1; 
if ( kk1 ) (a. ^= (w&15); w >>= 4; shift(); shift(); shift(); shiftQ; } 
k >>= 1; 


while ( k-- ) Y byte in( (uchar)w ); w>>=8; } 


return a_; 


} 
The operation is the optimized equivalent to 
while ( k-- ) 4{ bit in( (uchar)w ); w>>=1; } 


If two sequences differ in a single block of up to 64 bits, their CRCs will be different. The probability 
that different sequences have the same CRC equals 27% ~ 5.42. 1072. If that is not enough (and one 
does not want to write a CRC with more than 64 bits), then we can use two (or more) instances where 
different polynomials are used. Sixteen ‘random’ primitive polynomials are given [FXT: as 
static class member: 


const uint64 crc64::cc[] = 1 
0x5a0127dd34afie81ULL,  // [0] 
Ox4ef12e145d0e3ccdULL, // [1] 
Ox16503f45acce9345ULL, // [2] 
0x24e8034491298b3fULL, // [3] 
Ox9e4a8ad2261db8b1ULL, // [4] 
Oxbi99aecfbbi7a13fULL,  // [5] 
Ox3fifa2ccOdfbbf51ULL,  // [6] 
Oxfb6e45b2f694fb1fULL, // [7] 
0xd4597140a01d32edULL, // [8] 
Oxbd08baia2d621bffULL, // [9] 
Oxae2b680542730db1ULL, // [10] 
Ox8ecO06ec4a8fe8f6dULL, // [11] 
0xb89a2ecea2233001ULL, // [12] 
Ox8b996e790b615ad1ULL, // [13] 
Ox7eaef8397265e1f9ULL, // [14] 
Oxf368ae22deecc7c3ULL, // [15] 


E; 


These are taken from the list [FXT: data/rand64-hex-primpoly.txt|. Initialize multiple CRCs as follows: 


crc64 crca( crc64::cc[0] ); 
crc64 crcb( crc64::cc[1] ); 


A class for 32-bit CRCs is given in [FXT: class crc32 in|bits/crc32.h). Its usage is completely equivalent. 


'The CRC can easily be implemented in hardware and is, for example, used to detect errors in hard disk 
blocks. When a block is written its CRC is computed and stored in an extra word. When the block is 
read, the CRC is computed from the data and compared to the stored CRC. A mismatch indicates an 
error. 


One property that the CRC does not have is cryptographic security. It is possible to intentionally create 
a data set with a prescribed CRC. With secure hashes (like MD5 and SHA) it is (practically) not possible 
to do so. Secure hashes can be used to ‘sign’ data. Imagine you distribute a file (for example, a binary 
executable) over the Internet. You have to make sure that someone downloading the file (from any source) 
can verify that it is not an altered version (like, in the case of an executable, a malicious program). You 
create a (secure!) hash value which you publish on your web site. Any person can verify the authenticity 
of the file by computing the hash and comparing it to the published version. 


The cryptographic security of hash functions like MD5 and SHA is the object of ongoing research, see [344], 


[50], and BI]. 
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41.3.1 Optimization via lookup tables 


To feed an n-bit word w into the CRC in one step (instead of n steps), do as follows: Add w to (the 
CRC word) a. Save the lowest n bits of the result to a variable x. Right shift a by n bits. Add to a the 
entry x of an auxiliary table t. For n — 8 the operation can be implemented as [FXT: class tcrc64 in 


bits/tcrc64.h : 


1 uint64 byte in(uchar b) 

2 1 

3 a ^= b; 

4 uint64 x = t_[a_ & 255]; 
5 a_ >>= 8; 

6 a ^- xj 

7 

8 


The size of the table t is 2” = 256 words. For n = 1 the table would have only two entries, 0 and c, the 
polynomial used. Then the implementation reduces to 


1 uint64 bit_in(uchar b) 

3 a_ ^= (b&1); 

4 bool s = (a € 1); 

5 a_ >>= 1; 

6 if ( O!=s ) a ^= c_; // t[0]=0; t[1]=c_; 
7 return a_; 

8 } 


which is equivalent to the bit_in() routine of the unoptimized CRC. 


The lookup table is computed with initialization: 


1 for (ulong w=0; w<256; ++w) 

2 

3 set a(0); 

4 for (ulong k = 0; k<8; ++k) bit in( (uchar)w>>k ); 
5 t [v] = a_; 

6 


} 


The class can use tables of either 16 or 256 words. If a table of size 16 is used, the computation is about 
six times faster than with the non-optimized routine. A table of size 256 gives a speedup by a factor of 
twelve. Optimization techniques based on lookup tables are often used in practical applications, both in 
hardware and in software, see [73]. 


41.3.2 Parallel CRCs 


A very fast method for checksumming is to compute the CRCs for each bit of the fed-in words in parallel. 
An array of 64 words is used [FXT: class pcrc64 in bits/pcrc64.h|: 


1 template <typename Type» 
2 class pcrc64 
3 // Parallel computation of 64-bit CRCs for each bit of the input words. 
4  // Primitive polynomial used is x^64 + x^d + x°3 + x/2 + 1 
9 t 
6 public: 
7 Type x_[64]; // CRC data 
8 // bit(i) of x [0], x [1], ..., x_[63] is a 64-bit CRC 
9 // | of bit(i) of all input words 
10 uint pos. ; // position of constant polynomial term 
11 const uint m ; // mask to compute mod 64 
At initialization all words are set to all ones: 
public: 
perc64() 
: m_(63) 
reset (); 


“perc64() { ; } 
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10 void reset() 

11 { 

12 pos_ = 0; 

13 Type ff = 0; ff = ^ff; 

14 for (uint k=0; k«64; ++k) x [k] = 
15 } 


The cyclic shift of the array is avoided by working modulo 64 when feeding in words: 


1 void word_in(Type w) 

2 

3 5 uint p = pos_; 

4 pos_ = (p+1) & m_; 

5 uint h = (p-1) & m_; 

6 Type a = x_[p € m ]; // 0 
7 p += 2; 

8 a^-x [p£& n]; // 2 
9 ++p; 

10 a^- x [p£ n ]; // 3 
11 ++p; 

12 a `= x [p£ n ]; // 4 
13 x_[h] =a ^ w; 

14 } 


The algorithm corresponds to the Fibonacci setup of the linear feedback shift registers (see section 
fon page 567) There is no primitive trinomial with degree a multiple of 8, so we use the pentanomia 
x97 + z^ + 23 + x? +1. With an array size where a primitive trinomial exists the modulo computations 
would be more expensive. An unrolled routine can be used to feed in multiple words: 


1 void words_in(Type *w, ulong n) 

2 { 

3 if ( ni) 4 word_in(w[0]); ++w; } 
4 n >>= 1; 

5 

6 if ( n£i ) { word_in(w[0]); word_in(w[1]); w+=2; } 
T n >>= 1; 

8 

9 for (ulong k=0; k<n; ++k) 

10 { 

11 word_in(w[0]); 

12 word_in(w[1]); 

13 word_in(w[2]); 

14 word_in(w[3]); 

15 w += 4; 

16 } 

17 } 


The program [F XT: bits/pcrc64-demo.cc| feeds the numbers up to a given value into a perc64<uint>: 


int main() 


Type n = 32768; 
pcrc64<Type> P; 
for (Type k=0; k<n; ++k) P.word_in(k); 


// print array P.x_[] here 
} 


This rather untypical type of input data illustrates the independence of the bits in the array x. []: 
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The implementation can process about 2 GB of data per second when 64-bit types are used, 1 GB/s with 
32-bit types, 500 MB/sec with 16-bit types, and about 230 MB/sec with 8-bit types. 


41.4 Generating all revbin pairs 


polynomial: c cr c cr 
1.1111 1111.1 1.1111 1111.1 
k: X Xr k: x xr 
1: TI Deess 17: 11... vrl 
2: 1.111 111.1 18: sl... seli. 
3: 111.. ..111 19: vaii eddies 
4: lt 111. 20: E 11... 
5: ..111 111.. 21: 1.11. «11.1 
6: 1.1.. -—— 22: .1.11 11.1. 
T: ¿dde «led. 23: Last. Pe sat 
8: adat 1.15. 24: dsl 1..1. 
9: 1.1.1 1.1.1 25: 1..11 11..1 
10: 111.1 1.111 26: 1111. .1111 
11: 11..1 1..11 27: .1111 1111. 
12: 11.11 11.11 28: diia pee 
13: 11.1. .1.11 29: sd si 
14: .11.1 1.11. 30: sal Ls 
15: 1...1 ly, coved 31: ibas zoe 
16: 11111 11111 32: —Ó' 1.... <--= initial pair 
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Figure 41.4-A: All nonzero 5-bit revbin pairs generated by an LFSR. 


With a primitive polynomial of degree n and its reverse we can generate all nonzero pairs x and 
revbin(x,n) as follows [FXT: gf2n/lfsr-revbin-demo.cc|: 


inline void revbin next(ulong &x, ulong c, ulong &xr, ulong cr) 

// if x and xr are (nonzero) n-bit words that are a revbin pair 

// compute the next revbin pair. 

// c must be a primitive polynomial, cr its reverse (the reciprocal polynomial). 


1 
ulong s = (x & 1UL ); 
x >>= 1; 
xr <<= 1; 
if (s ) 
{ 


x ^= (c>>1); 
xr “= (cr); 


} 


An equivalent technique for computing the revbin permutation (see section |2.6 on page 118) has been 
proposed in [264]. Figure [41.4-A] shows all nonzero 5-bit revbin pairs generated with the primitive 
polynomial c = x? + z? +x +x + 1 and its reverse. 


41.5 The number of m-sequences and De Bruijn sequences 


'The shift register sequences generated with a polynomial of degree n is of maximal length if the polynomial 
is primitive. The corresponding shift register sequences are called m-sequences. 


We now consider all sequences that are cyclic shifts of each other as the same sequence. For given n there 
are as many m-sequences as primitive polynomials (P, = y(2" — 1)/n, see section |40.6 on page 843). 
These can be generated using the linear feedback shift registers described in section 41.1 on page 864 


One might suspect that using the powers of other elements than x might lead to additional m-sequences, 
but this is not the case. Further, the powers of elements of maximal order modulo irreducible non- 
primitive polynomials do not give additional m-sequences. 


The program [FXT: gf2n/all-primpoly-srs-demo.cc| computes all m-sequences for a given n. The output 
for n = 2,3,4,5,6 is shown in figure 
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degree = 2 

c=111 11 

degree = 3 

c-1.11 1.111 

c-11.1 111.1 

degree = 4 

c-1..11 1..11.1.1111 

c-11..1 1111.1.11..1 

degree = 5 

c=1..1.1: 241.2:1,11..11111...11.111.1.1 

c-1111.1 : <«11,.1, .11111.111...1.1.11.1 

c-11.111 111.:11.11111.1...1..1.1.11 

c-1.1111 4«1.11.1.1...111.11111.. 1.11 

c=111.11 : ..11.1.1..1...1.11111.11..111 

c=1.1..1: .1.1.111.11...11111..11.1..1 

degree - 6 

CE. Xo ases Liso did dd. Lido tod 11.111.11..11.1. 1.111111 
CS i. oss 1111..1..1.1.1..11.1....1...1.11.111111.1.111...11..111.11 
c-1.11.11 2 usos. 1.111111..1.1.1...11..1111.111.1.11.1..11.11...1..1....111 
CE11.11.1 T. 111....1..1...11.11..1.11.1.111.1111..11...1.1.1..111111.1 
cEAMT.. 11 io zu 11:111..11...111.1.111111.11.1...1:.5. 1:11. .1.1.: 41.1.1111 
CS11o. $5 ees 111111.1.1.11..11.111.11.1..1..111...1.1111..1.1...11....1 


Figure 41.5-A: All m-sequences for n — 2, 3, 4, 5, and 6. Dots denote zeros. 


n=4 nn - 16 n=5 nn = 32 

pee 1111.11..1.1 2 tees 11111.111..11.1.11...1.1..1 
2UIITi1.1.11..1.— sae tg 11111.111..11.1.11...1..1.1 
PITT des bed S grean 11111.111..11.1.1..1.11...1 
sed111.:1.11.1  — — — — — 11111.111..11.1.1..1...1.11 
lil liil 1 ——— ss 11111.111..11.1.1...1..1.11 
o ll1.1111..1. 9 2s 11111.111..11.1..1.1.11...1 
Owblzsl:.12111]1 areas 11111.111..11.1..1...1.1.11 
csi, lid: [--snip--] 
cl.1111.1..11  — iun 1...11..1.1.11.111.1..11111 
22121111..11.1. amas 1...11..1.1.11.1..11111.111 
e LAT wearers 1...11..1.1.11.1..111.11111 
E NE E tt © iiu 1..5:11..1.1:.11111.11. 1.111 
dio LLL —— — i 1.:..11..1.1..11111.1.11.111 
ds Ll. —— Senses 1...11..1.1..111.11.1.11111 

sewalewlilloi.idi tired 1...11..1.1..111.1.11.11111 

22.1..11.1.1111 total 4 of DBS found = 2048 

total 4 of DBS found = 16 


Figure 41.5-B: All De Bruijn sequences for n = 4 (left), the first and last sequence correspond to the 
two m-sequences for n = 4. The first and last few De Bruijn sequences for n = 5 (right). 


If a zero is inserted to the (unique) run of n — 1 zeros in an m-sequence, then a De Bruijn sequence (DBS) 
is obtained. A DBS contains all binary words including the all-zero word. 


For all n > 4 given there exist more DBSs than m-sequences. For example, for n = 6 there are 6 m- 
sequences and 67,108,864 DBSs. An exhaustive search for all DBSs of given length L — 2" is possible 


only for tiny n. The program [FXT: |bits/all-dbs-demo.cc) finds all DBSs for n = 3,4,5. Its output with 
41.5-B 


n — 4 and n — 5 (partly) is shown in figure 


'The total number of DBSs equals 
Sn = 2 where 2 = 27"? — (41.5-1) 


The two DBSs for n = 3are [...111.1] and [...1.111]=[1.111...], reversed sequences are considered 
different by the formula. The first few values of S, are shown in figure |41.5-C} The sequence is entry 


A016031lin [312]. We have 94,41 = S2 L4, equivalently 2, +1 = 2x4 4- n — 1. 


The general formula for the number of length-n base-m DBSs is S, = mI" Jm”, as given in 
sect.7.2.1.1). A graph theoretical proof of the formula for m = 2 can be found in [234] p.56], see 
also [358] entry “De Bruijn graph”]. For a more efficient approach to generate all DBSs of given length 


see section [20.2.2 on page 395 


41.6: Auto-correlation of m-sequences 


875 


n: I[5,,22" r= l-n Sa =2% 
1: 2 0 1 
2: 4 0 1 
3: 8 1 2 
4: 16 4 16 
5: 32 11 2,048 
6: 64 26 67,108,864 
T: 128 57 144,115,188,075,855,872 
8: 256 120 1,329,227,995,784,915,872,903,807,060,280,344,576 


Figure 41.5-C: The number Sn of DBSs of length Ln. 


We note that there are several ways to generalize the idea of the De Bruijn cycles to universal cycles for 
combinatorial objects as described in [103]. An explicit construction for universal cycles for permutations 
is given in [293]. The problem of finding a rectangular pattern such that all different patterns of given 
size appear is discussed in [188]. De Bruijn sequences for words with forbidden substrings are discussed 


in [254]. 


41.6 Auto-correlation of m-sequences 


1: [ .1..11.1.1111... ] C2 415 -1 -1 -1 —-1 =1 =1 -1 -1 -1 =1 -1-1-1-1 <- 
2: [.51251111.1.-11....] = +15 -1 -1 -1 -1 +3 -5 -1 -1 -5 +3 -1 -1 -1 -1 
3: [ .1.1..11.1111... ] = +15 -1 -1 -1 -1 -1 43 -5 -5 43 -1 -1 -1 -1 -1 
4: [ .1.1..1111.11... ] = 415: =1 =l =t -1 43 -1 -5.-5 -1 43 -1 -1 -1 -1 
5: Erudito did ds] = +15 -1 -1 -1 -1 -5 +3 -1 -1 +3 -5 -1 -1 -1 -1 
6: [ .1.11.1..1111... ] = +15 =1 -1 -1 -9 -1 +3 +3 +3 +3 -1 -9 -1 -1 -1 
T: [ .1.1111..11.1... ] = +15 -1 -1 -1 -1 -5 +3 -1 -1 43 -b -1 -1 -1 -1 
8: [ e E i a A E EP = +15 -1 -1 -1 -1 -1 -5 +3 +3 -5 -1 -1 -1 -1 -1 
9: [ .11..1.1111.1... ] = +15 -1 -1 -1 -1 -1 -5 +3 +3 -5 -1 -1 -1 -1 -1 
10: [ .11.1..1.1111... | = +15 -1 -1 -1 -9 +3 -1 +3 +3 -1 +3 -9 -1 -1 -1 
11: [ .11.1.1111..1... ] = +45 =1 =1 =1. =1 43 -5 —1 -1 -5 438 -1 -1 -1 1 
12: [ .11.1111..1.1... ] S15 =1 -1 -1 -1 43 -1 -5 -5 -1 43 =1 -1 =1 -1 
13: [ .1111..1.11.1... ] = +15 =1 <1 -1 -9 -1 +3 +3 +3 +3 -1 -9 -1 -1 -1 
14: [ .43111.1.:1.11... ] = +15 -1 -1 -1 -9 +3 -1 +3 +3 -1 +3 -9 -1 -1 -1 
15: [..41111.1.11..1..« ] S15 =1.=1 -1 1, -1 =f -1 =1 -1 -1-1 -1 -1-1. <== 
16: [ .1111.11..1.1... | C= +15 -1 -1 -1 -1 -1 43 -5 -5 +3 -1 -1 -1 -1 -1 


Figure 41.6-A: Cyclic auto-correlations for all truncated De Bruijn sequences for n = 4. Only first and 
the second last are m-sequences. 


We have seen that a De Bruijn sequence (DBS) can be obtained from an m-sequence by inserting a single 
zero at the longest run of zeros. In the other direction, if we take a DBS and delete a zero from the 
longest run of zeros, then we get a sequence of length N = 2” — 1 that contains every n-bit nonzero word. 
But these sequences are not m-sequences in general: most of them can not be generated with an n-bit 
LFSR and miss an important property of m-sequences. 


For a sequence M of N — 1 zeros and ones define the sequence S via 


-| 


Then, if M is a length-L m-sequence, we have for the cyclic auto-correlation (or auto-correlation function, 
ACF) of S 


+1 ifM,=1 


Sk —1 otherwise 


(41.6-1) 


L ifr =0 


L-1 
Cr := 9 Sk Sker mod £ = { —1 otherwise 


k=0 


(41.6-2) 


where L = N — 1 and N = 2”. That is, Co equals the length of the sequence, all other entries are of 
minimal absolute value: they cannot be zero because L is odd. 
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This property does not hold for most of the ‘truncated’ DBS (where one zero in the single run of n 
consecutive zeros is removed). Figure |41.6-A| shows all (signed) truncated DBS for n = 4 and their 
auto-correlations. Only 2 out of the 16 truncated DBS have an auto-correlation satisfying relation|41.6-2 


on the previous page} these are exactly the m-sequences for n = 4. 


For every odd prime q there are sequences of length L — q whose ACF satisfies 


bci 
L-1 ifr=0 
2 Sk Sk+rmodL = { d hase (41.6-3) 


The sequences start with a single zero: set So = 0, and for 1 < k < q set Sy = 1 if k is a square 
modulo q, else Sk = —1. A method to determine whether a number is a square modulo a prime is given 


in section on page The first three such sequences for primes of the form 4 k -- 1 and their ACFs 
are: 


5: S: [0, +1, -1, -1, +1] 
C: (4, -1, -1, -1, -1] 

13: S: [ 0, +1, -1, +1, +1, -1, -1, -1, -1, +1, +1, -1, +1] 
C: [12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] 

17: S: [ 0, +1, +1, -1, +1, -1, -1, -1, +1, +1, -1, -1, -1, +1, -1, +1, +1] 
C: [16, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] 


With primes of the form q = 4k 4-3 we can construct a sequence of length L = q satisfying relation|41.6-2 
on the preceding page| by simply setting Sy = 1 (Sy = —1 also works) in the sequence just constructed. 


e sequences for the first three primes q = 4k + 3 and their ACFs are: 


3: B: DH. +1; 5t] 
C: [ 3, -1, -1] 

T: S: [+1, +1, +1, -1, +1, -1, -1] 
C: [ 7, -1, -1, -1, -1, -1, -1] 

11: S: [+1, +1, -1, +1, +1, +1, -1, -1, -1, +1, -1] 
C: [11, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] 


These sequences can be used for the construction of Hadamard matrices, see chapter|19 on page 384 


41.7 Feedback carry shift registers (FCSR) 


There is an analogue of the LFSR in the modulo world, the feedback carry shift register (FCSR). With 
the LFSR we needed an irreducible (‘prime’) polynomial C where x has maximal order. The powers of z 
modulo C did run through all different (nonzero) words. Now take a prime c where 2 has maximal order 


(that is, 2 is primitive root modulo c, see section |39.5 on page 774). Then the powers of 2 modulo c run 


through all nonzero values less than c. 


An implementation of an FCSR is [FXT: class fcsr in bpol/fcsr.h : 


> class fcsr 

3 public: 

4 ulong a_; // internal state (a O*2**k modulo c), 1<=a<c 
5 ulong w_; // word of the SRS, 1 <= w <= mask 

6 ulong c.; // a prime with primitive root 2, e.g. 37 = 1..1.1 
T ulong mask ; // mask e.g. (with above) mask -- 63 -- 111111 
j public: 

10 fcsr (ulong c) 

11 

12 e =C; 

13 const ulong h = highest_one(c_); 

14 mask_ = ( h | (h-1) ); 

15 set_a(1); 

16 F 

17 

18 “fesr() {; } 

19 

20 ulong next () 

21 

22 a_ <<= 1; // a *= 2 


41.7: Feedback carry shift registers (FCSR) 877 
c=1..1.1 = 37 
0: am 2e 1- 1 w- .1..11 - 19 
I: a= 1. = 2 w= 1..11. = 38 
2 a= 1 = 4 w= ..11.. = 12 
3: a- 1 - 8 w= .11... = 24 
4: a= .1 = 16 w= 11.. = 48 
5 a2 d... - 32 w- 1.2 - 32 
6: a= .11.11 = 27 WE iw 1 = 1 
C3 a= .1...1 = 17 w= ....11 = 3 
8: a= 1...1. = 34 w= ...11. = 6 
9: a= .11111 = 31 w= ..11.1 = 13 
10 : = .11..1 = 25 = .11.11 = 27 
11 : = ..11.1 = 13 = 11.111 = 55 
12 : = .11.1. = 26 = 1.111. = 46 
13 : = ..1111 = 15 = .111.1 = 29 
14 : a- .1111. - 30 w- 111.1. - 58 
15 : a= .1.111 = 23 w= 11.1.1 = 53 
16 : a= ..1..1 = 9 w= 1.1.11 = 43 
17 : a= .1..1. = 18 w= .1.11. = 22 
18 : a= 1..1.. = 36 w= 1.11.. = 44 
19 : a= 1...11 = 35 w= .11..1 = 25 
20 : a= 1....1 = 33 w= 11..11 = 51 
21 : a= .111.1 = 29 w= 1..111 = 39 
22 : a= .1.1.1 = 21 w= ..1111 = 15 
23 : a= ...1.1 = 5 w= .11111 = 31 
24 : a= ..1.1. = 10 w= 11111. = 62 
25 : a= .1.1.. = 20 w= 1111.. = 60 
26 : a= ....11 = 3 w= 111..1 = 57 
27 : a= ...11. = 6 w= 11..1. = 50 
28 : = ,.,11.. = 12 = 1..1.. = 36 
29 : = .udll..- 24 = shown = 8 
30 : = ..1.11 = 11 = lo... = 17 
31 : a- .1.11. - 22 w- 1...1. = 34 
32 : a- .111 = 7 w= ...1.1 = 5 
33 a= 111. = 14 w= ..1.1. = 10 
34 : a- .111.. - 28 w= .1.1.. = 20 
35 : a= .1..11 = 19 w= 1.1..1 = 41 
36 : aces dm 1- 1 w= .1..11 = 19 <-- period restarts 
37 : a- 1. = 2 w- 1..11. - 38 
38 : a- 1 - 4 w= ..11.. = 12 
Figure 41.7-A: Successive states of an FCSR with modulus c = 37. 

23 if (a_>c_) a_-=c_; // reduce mod c 

24 

25 // update w: 

26 w_ <<= 1; 

27 w_ |= (a_ & 1); 

28 w_ & mask ; 

30 return w_; 

31 } 

32 

33 [--snip--] 

34 

35 void set a(ulong a) 

36 

3T w = 0; 

38 ulong t = c_; 

39 while ( (t>>=1) ) 

40 { 

Al if ( 0==(a & 1) ) a >> 1; 

42 else 

43 

44 a= (ag c_) + ((a 7 c_) >> 1); 

45 } 

46 } 

4T a_=a; 

48 next_w(); 

49 F 

50 

51 ulong get a() const { return a_; } 

52 ulong get w() const { return w_; + 

53 nh 


The routine corresponds to the Galois setup described (for the LFSR) in section [41.2 on page 867| see 
also [162]. Figure |41.7-A| shows the successive states of an FCSR with modulus c = 37. It was created 
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with the program [FXT: gf2n/fcsr-demo.cc|. Note that w does not run through all values < c but through 


a subset of c — 1 distinct values « 2?. 


X: p prime with 2 a primitive root, 2**x < p < 2**(x*1) 

1: 3 

2: 5 

3: 11 13 

4: 19 29 

5: 37 53 5 

6: 67 83 101 107 

T: 131 139 149 163 173 179 181 197 211 227 

8: 269 293 317 347 349 373 379 389 419 421 443 461 467 491 509 

9: 523 541 547 557 563 FN 613 619 Bos 659 661 677 701 709 757 
773 787 797 821 827 829 853 859 8 883 907 941 947 1019 

10: 1061 1091 1109 1117 1153 1171 d 1213 1229 1237 1259 1277 
1283 1291 1301 1307 1373 1381 1427 1451 1453 1483 1493 1499 
1523 1531 1549 1571 1619 1621 1637 1667 1669 1693 1733 1741 
1747 1787 1861 1867 1877 1901 1907 1931 1949 1973 1979 1987 
1997 2027 2029 


Figure 41.7-B: List of primes p « 2048 where 2 is s primitive root. 


A list of all primes less than 2048 for which 2 is a primitive root is shown in figure |41.7-B| The shown 
sequence is entry |A001122|in [812]. For further information on the correspondence between LFSR and 
FCSR see [208]. 


41.8 Linear hybrid cellular automata (LHCA) 


Linear hybrid cellular automata (LHCA) are 1-dimensional cellular automata (with 0 and 1 the only 
possible states for each cell) where two different rules are applied dependent on the position, therefore 
the ‘hybrid’ in the name. The computation of the next state with an LHCA can be implemented as 


follows [FXT: bpol/Ihca.h|: 


1 inline ulong lhca_next(ulong x, ulong r, ulong m) 

2  // LHCA := (1-dim) Linear Hybrid Cellular Automaton. 

3  // Return next state (after x) of the LHCA with 

4  // rule (defined by) r: 

5 // | Rule 150 is applied for cells where r is one, rule 90 else. 
6  // Rule 150 := next(x) = x + leftbit(x) + rightbit(x) 

7  // Rule 90 := next(x) = leftbit(x) + rightbit(x) 

8  // Length defined by m: 

9  // m has to be a burst of the n lowest bits (n: length of automaton) 
10 

11 E r &= x; 

12 ulong t = (x>>1) ^ (x<<1); 

13 t ^= r; 

14 t &= m; 

15 return t; 


Note that the routine is branch free and implementation in hardware is trivial. 


The naming convention for the rules is as follows: draw a table of the eight possible states of a cell 
together with its neighbors, then draw the new states below: 


XXX XXO XOX X00 OXX 0X0 00X 000 
0 X 0 X X 0 X 0 
Now read the lower row as a binary number, the result equals 010110105 — 90, so this is rule 90. Rule 150 
corresponds to 100101105 — 150: 
XXX XXO XOX X00 OXX OXO OOX 000 
X 0 0 X 0 X X 0 


A run of successive values for the length-16 weight-2 rule vector r = 400116 starting with 1 is shown on 
the left side of ddp ci For certain rule vectors r all m — 2" — 1 nonzero values occur, the period 
is maximal. This is demonstrated in NOIRS gf2n/Ihca-demo.cc|, which for n = 5 and rule r = 1 gives 
the output shown in the middle of figure |41.8-A| Rule vectors with minimal weight that lead to maximal 


period are given in [94]. The list in |FXT: bpol/Ihcarule-minweight.cc, is taken from that source: 
1 #define Ri(n,s1) (AUL<<s1) 


2 #define R2(n,si,s2) (1UL<<s1) | (1UL<<s2) 


41.8: Linear hybrid cellular automata (LHCA) 879 


rule rule = .1 rule = 1111. 
1 1 PEN 1 os 
2 2 saa dd 2 $a. 
3 3 2.11; 3 . ¿111 
4 4 ¿1111 4 es 
5 5 11. 5 11.1 
6 6 111.. 6 ccdd, 
7 7 1.11. 7 ee Be | 
8 8 . ¿111 8 1111, 
9 9 «11. 9 ;11.1 
10 10 1111. 10 1232 
11 11 1..11 11 11... 
12 12 .111. 12 eis 
13 13 11.11 13 ;111.; 
14 14 11.1. 14 1.1.1 
15 15 11..1 15 1.1.. 
16 16 11111 16 1.11, 
17 17 1. 17 1...1 
18 18 con 18 11.1, 
19 19 LiL.. 19 nh 
20 29 sid. 20 S RT 
21 2 sss. 21 «LL: 
22 22 “1: 1 22 1.1. 
23 23 1.111 23 11111 
24 24 sil. 24 ¿1111 
25 25 sid: 25 1.111 
26 26 1...1 26 1..11 
20 27 .1.11 27 111.1 
28 28 1..1. 28 lcs 
29 29 11.1 29 111.. 
30 30 111.1 30 pied: 
31 31 1.1.1 31 11.11 
32 
33 

65531 

65532 

65533 

65534 

65535 


[e 
Oe 00-1O»O0 402 —OC00-IOccuio0) 


1 
2 
3 


Figure 41.8-A: Partial run of a 16-bit LHCA (left) and complete runs of two 5-bit LHCAs (middle and 
right). The LHCAs shown have maximal periods. 


extern const ulong minweight lhca rule[]- 
// LHCA rules of minimum weight that lead to maximal period. 


1 
0, // (empty) 

R1( 1, 0), 
R1( 2, 0), 
R1( 3, 0), 
R2( 4, 0, 2), 
R1( 5, 0), 
R1( 6, 0), 
R1C 7, 2), 
R2( 8, 1, 2), 
R1( 9, 0), 
R2( 10, 1, 6), 
R1( 11, 0), 
R2( 12, 2, 6), 
R1( 13, 4), 
[--snip--] 


Up to n — 500 there is always a rule with weight at most 2 that leads to the maximal period. The full 


list of these rules is given in [FXT: data/minweight-lhca-rules.txt). 
41.8.1 Conversion of LHCAs to binary polynomials 


To convert a length-n LHCA to a binary polynomial proceed as follows: initialize p_1 :— 0, po :— 1, and 
iterate for k = 1, 2, ..., m: 


pk t= (2+Tr-1)Pr-1 + Dk—2 (41.8-1) 


where r; denotes bit i of rule r. The degree of the returned polynomial p, is n. An implementation of 
the algorithm is [FXT: bpol/lhca.h : 


inline ulong lhca2poly(ulong r, ulong n) 
// Return binary polynomial p that corresponds to the length-n LHCA rule r. 
1 
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LHCA rule polynomial 

1: r= c= r= [0] 

2: r=. c- r- [0] 

3: r- c- r- [0] 

4: r= c= r= [0, 2] 
5: r- c- r- [0] 

6: r- c= r= [0] 

T: r= c= r= [2] 

8: r= c= r= [1, 2] 
9: r= c= r= [0] 

10: r= c= r= [1, 6] 
11: r= c= r= [0] 

12: r= c= r= [2, 6] 
13: r= c= r= [4] 

14: r= c= r= [0] 

15: r= c= r= [2] 

16: r= c= r= [0, 14] 
17: r= c= r= [4] 

18: r= c= r= [0, 16] 
19: r= c= r= [2] 
20: r= c= r= [1, 2] 
21: r= c= r= [0, 9] 
22: r= c= r= [4] 
23: r- rr 11.1.1223 111.1...1 r= [0] 
24: r- (RS LS iab Pee 1.1...1.1..11111..1 r= [7, 11] 
25: r= CH. esa es ye ogee A er ee 111..11 r= [8] 
26: r- GEI us A 52 11.111 r= [0] 
27: r= c=....1.111..1..111....11.1..111.1 r= [o, 19] 
28: r- 6m: 111111.1.3111.. E 1.1.1111 r= [2] 
29: r- 6m..11,111..... TU kc be wae hea acs 111 r= [0] 
30: r= ee. del O LE Era lidere 11 r= [0] 
31: r= [sb HR 1...1...1 r= [10] 


4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 


Figure 41.8-B: Minimum-weight LHCA rules and the corresponding binary polynomials. 


ulong p2 = 0, pi = 1; 
while ( n-- ) 
{ 
ulong m = r & 1; 
r >>= 1; 
ulong p = (p1<<1) ^ p2; 
if (m) p ^= pl; 
p2 = pi; pi = p; 
return pi; 
} 


The lexicographically first minimum-weight LHCA rules and their binary polynomials are shown in 


figure|41.8-B| The table was created by the program [F XT: gf2n/Ihca2poly-demo.cc|. For rules of maximal 


period the polynomials are primitive. 


An LHCA rule and its reverse give the identical polynomial. The polynomials corresponding to an LHCA 
rule and its complement are either both reducible or irreducible: if p(x) corresponds to the LHCA rule 
r, then p(a +1) corresponds to the rule that is the complement of r. 


Figure |41.8-C] shows a list of LHCA rules with maximal period where the highest bit lies in the lowest 


possible position. The list can be produced with [FXT: gf2n/lowbit-Ihca-demo.cc|. For software imple- 


mentation of LHCAs that need more than one machine word low-bit rules are advantageous. A table of 
low-bit LHCA rules corresponding to primitive polynomials up to n — 400 is given in [FXT: data/lowbit- 


lhca-rules.txt|. The maximal value of a rule that occurs is r = 1293 for n = 380 so the table can be 


stored compactly, for example, in an array of 16-bit integers. 


Figure |41.8-D| shows a list of LHCA rules that have minimal weight and the highest bit in the lowest 


possible position. The list can be produced with [FXT: gf2n/minweight-lowbit-Ihca-demo.cc|. 
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de SESGCbLn eee gau. RISE eds dert ie 1 Cm d vilitas Du d AS os 11 r= [0] 
Dis EA Asus e re ipe E a IIS UE awh 1 mV Vella suci RU e Ep Ra 111 r= [0] 
IE. "TT 1 (IL RR 11.1 r- [0] 
LM CPC EN 1.1 E iS Asse ta iod edet bate RUM 1..11 r= [0, 2] 
5: T E dot ack N diet ade 1 But Melb a da elc eA Mens Re 11.111 r= [0] 
Gs VET. heh eA Regen die eR EUR MIS ack Errem 1 CH Seo ies eeu ang dir Rose edd eS 111..11 r= [0] 
Ti UEBER fes un AL Ape Eier eet A dits: Qm is eile Anarene Eee eee see T doo uM 1 r= [2] 
Hc IN D 11. ME TT 1...111.1 r= [1, 2] 
Os sce ut pcena E e DB E yea etd 1 TN 11.111..11 r= [0] 
TQ ae IU ade ELA dc I eee dedo 1111 (c PN M NEIN FEN 1.111111.11 r= [0, 1, 2, 3] 
11: SNe OEE ete be Se RLM PNE PM T Sa hie ET HE 11.1...111.1 r= [0] 
VDE ES begs wawa. XAR RM dea Iac e 6 1.11. GE Vales s GR re a e IIO A Till... 1 r= [1, 2, 4] 
MS UG ES tet ta WA EA lid ES T ga a d RC VH RC CNEN 1.11..1.1111.1 r= [0, 3] 
VE: E case ep Sante E aay Sie ae Slee 1 OF suu gree A ie d Es Seamer x DO ae 11 r= [0] 
15: ERG eg eP A AH SE aE acd des QE eui e pte dren eee ke AA eats Totes hk re [2] 
16: EUR eue Sx A EA ROE Be Lolo E inb a A E epi 11.1..1111..1.111 r= [0, 2, 4] 
AGS PROV ce hig eae a Be ee wo Sod 11 ERES a queque Lii. Es IIA 11 r= [0, 1] 
T$ (CN LV os Pu XX kon E E 1.11. [c hg, E SENS is Ade A r2 [1,.2; 4] 
jg SEL CUL. au d d hg otha oh RE ERE CR ds CS AAA TT. cn 111.1....11.1 r- [2] 
JUI ES Sel ae RU pack BIR AT dx Pd S 11. ER ISPs eua lo Tli... 1...1...1 r= [1, 2] 
21: Bee aes hia Aces Mae IE en ae Ido. Sy yg NU En 1111.11.1111.1..1.11.1 r= [1, 4, 5] 
Zt aum ouk ode AE EQ ER ARR 1.11 ES 11.1.111..11.11.11.1.11 r= [O, 1, 3] 
We. LXI Ate EEN pde wor Su a ac APR Y 1 [PEE IE sis ae pert rea eee 111.1...1 r= [0] 
es ES hate aie bis qox d a Eck DR Nee a 1.111. AE ... Licda bedded “r= [1; 2, 3; 5] 
PA E E RN 1.11 CTS esu 141.201 1.-1:11.1.2.-1:1^ r2 TO, 1; 3] 
26: Toe duce pon Whaat E E Ge Hung RR NC REN 1 -— C 111 - DI E Sets 11.111 r= [0] 
Da ERA PP TET i11. e AA A A lo: 1.1111 r= [2, 3, 4] 
Po LM CIC "DT 1 Er AGT e 1.1.1111 r= [2] 
ZU. ANS Por ay wd d x RO CR ak 1 Es 11.111... TIL xg s 111 r= [0] 
DOT OESU CY eic RON RR QU oe a d Sa E de 1 €7,111..11...... Tdi EXTRA YER 11 r- [0] 
did Vis oes Rite eo re hy ee ee 11:1.: 6711111.1. e a P 1....1.11 r= [1, 3, 4] 
Figure 41.8-C: Low-bit LHCA rules and the corresponding binary polynomials. 
LHCA rule polynomial 

O E n QA ire as 1 Cm aA a HN AI eiut Sn ghe; 11 r= [0] 
Qi c GE cle ER 1 (I M MERE 111 r= [0] 
dd. ERG wists albus bias der eter 1 Bode ie Tes a ae 1 GE yanai s ua veiut beter dea viet d 11.1 r= [0] 
Qs CEBSÓ.CIAWG acu eati noi eee tud 1.1 mim PR CAA 5 RE a pU BES 1..11 r= [0, 2] 
Bt Ee vere d ne Raat ala apa od ie ay 1 is A Ua RR A ed Ne RAS RIA 11.111 r= [0] 
A ev saa Ga eer ra iiM s Samara iss 1 Qm ocu ddl ame leases Genaeen ee 111..11 r- [0] 
Ti Eua dem erste td eo do CE petis are E Ra great ua spel 1172:x. 1 r- [2] 
Bk A 11. CE ubl A s 1...111.1 r= [1, 2] 
Qe Siete Deas e ay Ls ale ee Bases 1 Cd agli sesta el fs Soe a 11.111..11 r= [0] 
VO CES Glas et eq aa ea id 5 eee ems IES QUE Gian A ih ae aA A iR e cR 1...11.1111 r= [1, 6] 
ilte AER hawk ct A RAY PAW RAS OD Cac Wale x7 1 a ee a a ae eens ree 11.1...111.1 r= [0] 
LE TD T da OA CF sates "s dtes 11111.1 r= [2, 6] 
Be DESC USC ee Ri oe APA EE NER Vie ie GE by ie Mao PE Rd A ATIS 111 r= [4] 
4i o 45a c Saat Wo eR a EN Beare AL T CE os a orra X niei E a a lores II 11 r= [0] 
UB eS "TERM 1,5; Car PTT 11:222. 1...1...1 r- [2] 
103 “TH hes ives RR edo 11:324 CH "rrr Luo 111111 r= [4, 6] 
Ad RS Pee ek we NY Xu GANE ROCA RI Liers A SA 11..11..11.1....11 r= [4] 
dor Oh eh a a UD AT d eR l1iossls PEE dd 1...11.11.1111111.1 r= [1, 5] 
AE Eebenraws Sa eR PAW Aa OD P me E dA Rma s waa wy 11.15 PAP ns hy x=" 12] 
207 DES ee ee eod eee d 11. CF tigi steer Tow Y di E 1...1...1 r= [1, 2] 
Zl SE og es WW En e Rob ATUS 11.4 [oL PENES s n DM A 1.1.1...1 r= [5, 6] 
A elu Rae OTR WRG I EAR Loses TIT ES 1111..11...1.11.1111.11 r= [4] 
A AM uper ROUTE cy dx 1 CEA apex A necat 111.1...1 r= [0] 
dil. WES IR p AEG E ERR A ee 6T. saa gs lw 1.1...1.1..11111..1 r= [7, 11] 
461. MEI oca ERA OR EN CSS loei C=...... l1ic.lil.11.. ck 111..11 r= [8] 
FILM. ise as el Pe ee E EE 1 C=..... Idd- is rs Gk oes Re l.c 11.111 r= [0] 
d Cea cae Oe eo Ae wee ORs 11:2 Gm aoe Be lile us lb. DELI... 1.1 r= [3, 6] 
ZU ESL AN AN E | Bos ATA ALT ks ee 1.1.1111 r= [2] 
Boo? CERE LLL ead Wap AV x S IQ Co Rr Pee tar 1 c=..11.111..... i Oy O toes 111 r= [0] 
BOE OmU Iu vica Ew oy ee oh i See EY T 67.111.111... E B E E es 11 r- [0] 
alit GERI ocenLeeex a Reis Mur dación A uer Dala Bs 1...1...1 r= [10] 


Figure 41.8-D: Minimum-weight low-bit LHCA rules and the corresponding binary polynomials. 
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41.8.2 Conversion of irreducible binary polynomials to LHCAs 


Computing the LHCA corresponding to an irreducible binary polynomial p(x) proceeds in two steps. 
Firstly, the following quadratic equation over GF(2”) must be solved for Z: 


ZP+(0+0p(2)2+1 = 0 (41.8-2) 
The algorithm is given in section|42.4 on page 896| The second step isa GCD computation. Set Z,_1 :— p, 


mod p(x) 


Zn-2 := Z, and compute successively Zi 3, Zn—4, -- 


., Zo such that 


Zn-1 (a sp ro) Zn—2 + 4n-3 (41.8-3a) 
Zn—2 (a e rı) Zn-3 T Zn—4 (41.8-3b) 
Z2 (a Te Tn—3) Zi +1 (41.8-3c) 
Zi = (a i Tn—2) Zo +1 (41.8-3d) 
Zo (£ +rn-1)1+0 (41.8-3e) 


Each step consists of a computation of polynomial quotient and remainder (see section 


The vector [rn—1, Tn—2, ..., To] is the LHCA rule. 
follows [FXT: bpol/bitpol21hca.cc : 


1  ulong 

2  poly2lhca(ulong p) 

3 // Return LHCA rule corresponding to the binary polynomial P. 
$ d Must have: P irreducible. 

6 ulong dp = bitpol deriv(p); 

7 const ulong h = bitpol_h(p); 

j ulong b - dp; 

10 b ^= bitpolmod times x(b, p, h); // p? * (x+1) 

11 b = bitpolmod times x(b, p, h); // p? * (x^2*x) 

12 

13 ulong rO, ri; // solutions of 1 + (p’*(x*xtx))*z + z*z == 0 modulo p 
14 bool q = bitpolmod solve quadratic(1, b, 1, r0, ri, p); 
15 if ( 0-74 ) return 0; 

16 

17 // GCD steps: 

18 ulong r - 0; // rule vector 

19 ulong x =p, y= r0; // same result with r1 

20 while ( y ) 

21 1 

22 ulong tq, tr; 

23 bitpol divrem(x, y, tq, tr); 

24 r <<= 1; 

25 r l= (tq & 1); 

26 x= y; 

27 y = tr; 

28 } 

29 

30 // r = revbin(r, bitpol_deg(p)); 

31 return r; 

32 } 


40.1 on page 822). 
An implementation of the method can be given as 


The described algorithm is given in the very readable paper [93] which is recommended for further studies. 
Note that the paper uses the reversed rule. To use the reversed rule, uncomment the line just before the 


final return statement. The program [FXT: |gf2n/poly2lhca-demo.cc| converts a given polynomial into 


the corresponding LHCA rule. 


41.9 Additive linear hybrid cellular automata 
The algorithm for the conversion of LHCA rules to binary polynomials is a special case of a general 
method for additive cellular automata. An automaton is called additive if, for all words a and b, 


N(a)--N(b = N(a+b) (41.9-1) 
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where N (x) is the next state after the state x and addition is bit-wise (XOR). 


41.9.1 Conversion into binary polynomials 


For additive automata the action of N on a binary word can be described by a matrix over GF(2): Let 
ex be the word where only bit k is set. We seek the matrix whose k-th row is N(ez). We use a cyclic 
LHCA (CLHCA) as an example. A CLHCA is a LHCA with cyclic boundary condition. Its action N 
can be implemented as FXT: [bpol/olhea. t 


1 ulong clhca next(ulong x, ulong r, ulong n) 
2 4 
3 r &= x; 
4 ulong t - x ^ bit rotate right(x, 1, n); 
5 t ^= r; 
6 return t; 
7 4} 
The action of a CLHCA with rule r := [ro, r1, ..., r5 1], as the matrix, is 
To 1 
1 m 
1 T2 
1 T} 
M, := . . (41.9-2) 
1 Tn-3 
1 Tn-2 
1 Tn—1 


where 7; is the complement of rọ and blank entries denote zero. The binary polynomial corresponding 
to the automaton is the characteristic polynomial of the matrix M,.: 


inline ulong clhca2poly(ulong r, ulong n) 
// Compute the binary polynomial corresponding to the length-n CLHCA with rule r. 


ALLOCA(ulong, M, n); 

for (ulong k-0; k<n; ++k) M[k] = clhca next( 1UL<<k, r, n ); 
ulong c = bitmat_charpoly(M, n); 

return c; 


ONDT AUNE 


} 


To compute the polynomial for any additive automaton, replace clhca_next() by the update for the 


automaton. The routine bitmat_charpoly() is given in [FXT: bmat/bitmat-charpoly.cc|, it uses an 


algorithm from p.55]. 


41.9.2 Properties of the CLHCA 


r= .111 r= 1.11 r - 11.1 r = 111. 
1: TE is jell xdi Seal. 
2: les Wie lo: lel 
3: Tiss si iia 11.1 
4: 111.. 211; sedis 1111 
5: 1111 .111 mp 111. 
6: .111 1111 1.11 .111 
T: 1.11 1.11 1111 1.1. 
8: 2i 11.1 11.1 .1.1 
9: 1.1. 1.1. 111. 1.11 
10: 11.1 : 1.1. zi. 11.. 
11: fta 111. ll. .11. 
12: iatt we ddl Pee Del veil 
13: T. 1..1 1.1 i: 
14: sla ni dl... she 
15: 1 vb . 11 sel 


Figure 41.9-A: Rules of identical weight lead to essentially identical CLHCA: all tracks in the successive 
states of all weight-3 (length-4) automata are cyclic shifts of each other. 
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k r = CLHCA rule c = binary polynomial 
0: Xm li er oa Gm Ob sites nier clasts eve 1. 
1: T A 1 px EEEE 11 
2: d aa 11 c = 1111111111111111.1 
3: E L— a ls act 111 c = 1.1.1.1.1.1.1.1..1 
4: r= san ls DE Cc 7 11..11..11..11...1 
5: ro ...11111 AA P 
6: r= .111111 c = 1111....1111..... 1 P 
T: P aisle as: 1111111 c= dida iii 1 
8: Em dia 11111111 e A ais 1 
9: r c ums 111111111 Cea ane Or ere ane em rere 1 
10: Em reina 1111111111 c = 11111111......... 1 
11: I m ooi 11111111111 e Sd Di 1... 1 P 
12: I — uu 111111111111 Cum... esses 1 P 
13: r = ....1111111111111 Com leka hres niia 1 
14: r = ...11111111111111 CS TN EE 1 
15: r = ..111111111111111 ¡A A 1 
16: r = .1111111111111111 G = UT 1 
17: r = 11111111111111111 DX t 1 


Figure 41.9-B: Binary polynomials (right) corresponding to rule vectors for the length-17 cyclic linear 
hybrid cellular automata (left) with k bits set. Automata with maximal period are marked with ‘P’. 


w-2: r=...11 =3: r-..111 w-1 r- 1 
1: ee d 1: 224.1 1: a 
2: Tox 2: 12: 2: i 
3: 11.4. 3: 11... 3: 11... 
4: 1.1.. 4: 1.1... 4: 1.1. 
5: 1111. 5: 11.1. 5: 1111. 
6: 11i 6: 1.1.1 6: 1. sd 
T: Lal T: SUM T ss 
8: 111... 8: .11.1 8: Ba e eae 
9: 1.1. 9: 1111. 9: 21.1. 
10: 11..1 10: 1.111 10: .1111 
11: oa os 11: d. 11 11: a e a l 
12: sedis 12: 111.1 12: sedes 
13: ..111 13: ..11. 13: ..11. 
14: 1.111 14: ¿cg Ll 14: vid. 
15: .1111 15: ees 15: 1.11. 
16: 11.11 16: e 16: 111.1 
17: sotal 17: Ales 17: sasda 
18: 1.11. 18: -111. 18: 11 
19: 11111 19: .1111 19: 1..11 
20: Sela! 20: 11111 20: 1.11 
21: 1.41 21: ..111 21: 11111 
22: sus 22: 1..11 
23: l1. 23: .1..1 
24: LL. 24: 111.. 
25: .11.1 25: 1.11. 
26: 11.1. 26: 11.11 
2T: 1.1.1 27: eke d 
28: .111. 28: 1..1. 
29: .1.11 29: 11.41 
30: 111.1 30: Hu 
31: vnd 31: 1 


Figure 41.9-C: Successive states of the length-5 CLHCA with weights 2, 3, and 1 (left to right). 


The polynomials for the given automata depend only on the number of bits set in the rule r. Figure [41.9-] 
[A] shows the successive states for all length-4 CLHCA with rules of weight 3. As there are essentially 
only n + 1 different automata of length n, we only need to investigate the rules with the lowest k bits set 
for k = 0,1,...,n. The polynomials for the length-17 automata are shown in figure [41.9-B] If we write 
C x(x) for the polynomial for the length-n rule with k set bits, then C4 (x) = Cy, (x +1). Indeed, 
they can be given in the closed form 


Cakl) = 1+a*(1+2)"-* (41.9-3) 


Thus the polynomials can be computed with the simple routine [FXT: bpol/clhca.h 


inline ulong clhca2poly(ulong r, ulong n) 


{ 
ulong c = 1UL << n; 


for (ulong k-0; k<n; ++k) if ( O--(r € (1UL<<k)) ) c “= (c>>1); 


BOO DL 
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n w: c = polynomial 
2 1: c = 111 
3 1: c = 1.11 
4 3: c = 11..1 
5 2: c = 1111.1 
6 1: c = 11..111 
T- 1: c= 11:1-11 
[8] 
9 4: c = 11..11...1 
10 3: c = 11111111..1 
11 2: Cm llis 11.1 
[12,13,14] 
15 4: c = 1111....1111...1 
[16] 
17 3: eS tet del dled 
18_ 7: Cc TIT 4111 .. 1 
[19] 
20 3: Com seals 11..1 
21 2: c = Hill 1111.1 
22 21: [IL p pd 
23. 5: (x M caia 1.1 1 
[24] 
25 sire E eas A 1:1.1.1..1 
[26,27] 
28 cs tits e AAA 
29 2 € — 1111... Mil... 1111... 1111.1 
[30] 


Figure 41.9-D: The polynomials (c, right) corresponding to rules of lowest weight (w) such that the 
length-n (n < 30) CLHCA has maximal period. 


c ^= 1; 
return c; 


NO 


With n = 5 there are just two rules that lead to maximal periods, r = [0,0,0, 1, 1] (weight 2) and r = 
[0, 0, 1, 1, 1] (weight 3). The successive states for both rules are shown in fure prot] The polynomials 
corresponding to the rules of minimal weight for all length-n automata where n < 30 are given in 
figure |41.9-D} The sequence of values n where a primitive length-n CLHCA exists starts as: 


2,3, 4, 5, 6, 7, 9, 10, 11, 15, 17, 18, 20, 21, 22, 23, 25, 28, 29, 31, 33, 
35, 36, 39, 41, 47, 49, 52, 55, 57, 58, 60, 63, ... 


It coincides with entry |A073726 in [312], values n such that there is a primitive trinomial of degree n. 


The sequence was computed with the program [FXT: gf2n/clhca-demo.cc). A list of CLHCA rules with 
maximal period is given in [FXT: data/clhca-rules.txt . 
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Chapter 42 


Binary finite fields: GF(2”) 


We introduce the binary finite fields GF(2”). The polynomial representation is used for stating some 
basic properties. The underlying arithmetical algorithms are given in chapter An introduction of 
the representation by normal bases follows. Certain normal bases are advantageous for hardware imple- 
mentations of the arithmetical algorithms. Finally, several ways of computing the number of irreducible 
binary normal polynomials are given. 


Binary finite fields are important for applications like error correcting codes and cryptography. 


42.1 Arithmetic and basic properties 


In section on page we discussed the finite fields Z/pZ = GF(p) for p a prime. The ‘GF’ stands 
for Galois Field, another symbol often used is F,. The arithmetic in GF(p) is the arithmetic modulo p. 


There are more finite fields: for every prime p there are fields with Q = p" elements for all n > 1. 
All elements in a finite field GF(p") can be represented as polynomials modulo a degree-n irreducible 
polynomial C with coefficients over the field GF(p). The arithmetic to be used is polynomial arithmetic 
modulo C. As in general there is more than one irreducible polynomial of degree n it might seem that 
there is more than one field GF(Q) for given Q — p". There isn't. Using different polynomials as modulus 
leads to isomorphic representations of the same field. The field GF(p") is called an extension field of 
GF(p). The field GF(p) is called the ground field (or base field) of GF(p”). 


When speaking about an element of GF(Q), think of a polynomial modulo some fixed irreducible polyno- 
mial C (modulus). For example, the product of two elements can be computed as the polynomial product 
modulo C. For the equivalent construction using the polynomial x? + 1 with real coefficients that leads 


to the complex numbers see section |39.12 on page 804 


The polynomial C used as modulus is called the field polynomial. 


'The elements zero, the neutral element of addition, and one, the neutral element of multiplication, are 
the constant polynomials with constant zero and one, respectively. This does not depend on the choice 
of the modulus. 


Multiplication with an element of the ground field is called scalar multiplication. In this section an 
element of the ground field is denoted by u. Scalar multiplication corresponds to the multiplication of 
every coefficient of (the polynomial representing) the field element a by u. 


We restrict our attention mostly to Q = 2”, that is, the binary finite field GF(2") as we have seen the 
algorithms for the underlying arithmetic in chapter [40|on page 
42.1.1 Characteristic and linear functions 


The characteristic of the field GF(p") is p: we identify two elements if their difference is a multiple of p 
(when adding any element of the field p times to 0 the result will be 0). For infinite fields such as C the 
characteristic is 0: two elements are identified if their difference is a multiple of 0 (that is, the underlying 
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equivalence relation is equality, see section [3.5.2.1|on page|149). Note that the notion “multiplication is 
repeated addition” is meaningless in an extension field GF(p”). 


For GF(p”) we have 
(uso? = uP+v?P (42.1-1) 
because the binomial coefficients m are divisible by p for k = 1,2,...,p — 1. For GF(2”): 
(u+? = u+? (42.1-2) 
We call a function f linear if the relation 
f(ui:a- usb) = uw- fla)+uz- f(b) (42.1-3) 


holds for u; and uz from the ground field. The linear functions in GF(p") are of the form 


n—1 
fa) = Yua (42.1-4) 
k=0 


where the u; are again in the ground field. Linear functions can be computed using lookup tables. In 
GF(2") these are all functions of the form 


n—1 
f(z) = Yopa” (42.1-5) 
k=0 


42.1.2 Squaring 


Squaring (and raising to any power 2%) is a linear operation in GF(2”). The linearity can be used to 
accelerate the computation of squares. Write 


(uo + uix + usa? +... + uni arty? = gu z^ tuge +... ua gon) (42.1-6a) 
=: ug So + U1 $1 + U2 S2 +... Un—1 Sn—1 (42.1-6b) 
One has to precompute the values s; = xz?! mod C for i = 0,1,2,...,n — 1. For successive square 


computations it is only necessary to add (that is, XOR) those s; corresponding to nonzero u;. For 
example, with n = 13 and the polynomial modulus C = z! + xt + 23 + x! +1 one obtains the table 


DOS asad de Sango 1 
DIU udis a 14 
D2 = audies Ioasa 
S32 ....... JD DEUS 
94 = ...,. Leek baie 
DOCS ond ix Ei 
96 = doles p eur 
Sf = ........ 11,11 
98-7 ....,, 11,11. 

og = ..,,11,11..... 
9010 ^..11.11.2.:..:; 
911 = .1,11..,.11.11 
512 = .11....1.11.1 


2i 


The squares sọ, s1... 8|(n—1)/2| are simply s; = z^' which can be used to further accelerate the compu- 


tation. 


42.1.3 Computation of the trace 


The trace of an element u in GF(2") is defined as 


Tr(a) := a-a?-ra*--aP + NT = ya (42.1-7) 
j= 


© 


The trace of any element is either 0 or 1. The trace of 0 is always 0. The trace of 1 in GF(2”) equals 1 
for odd n, else 0, that is, Tr(1) 2 n mod 2. Exactly half of the elements have trace 1. 
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The trace of the sum of two elements is the sum of the traces of the elements: 
Tr(a+b) = Tr(a) + Tr(b) (42.1-8) 
With u zero or one (an element of the ground field GF(2)) we have 
Tr(u-a) = u-'Tr(a) (42.1-9) 
The trace function is linear: for u and uz from the ground field we have 


Tr(u -a+uz:b) = w-atug-b (42.1-10) 


A fast algorithm to compute the trace uses the trace vector, a precomputed table t; = Tr(a’) for i = 
0,1,2,...,n — 1, and the linearity of the trace 


n-1 
Thay = Y uiti (42.1-11) 
1=0 


where the u; are zero or one. Precompute the trace vector tv whose bits are the traces of the powers of 
x and later compute the trace of an element via [FXT: bpol/gf2n-trace.h : 


1 inline ulong gf2n fast trace(ulong a, ulong tv) 

2  // Fast computation of the trace of a 

3  // using the pre-calculated table tv. 

4 {return parity( a & tv); } // scalar product over GF(2) 


Given the trace vector it is also easy to find elements of trace zero or one by simply taking the lowest 
unset or set bit of the vector, respectively. There are polynomials such that the trace vector contains just 


one nonzero bit, see section [42.3.3] on page [896] 
42.1.4 Inverse and square root 
The number of elements in GF(Q) equals Q. For any element a € GF(Q) one has 
a? = a (42.1-12) 
and so a9! = 1. So we can compute the inverse a”? of a nonzero element a as 
g^ = ae? (42.1-13) 


We have seen this technique of inversion by exponentiation in section |39.7.4| on page for the special 
case GF(p). 


All elements except zero are invertible in a field. That is, the number of invertible elements (units) in 
GF(Q) equals |GF(Q)*| 2Q—1-— p" — 1. 


Every element a of GF(2") has a unique square root s which can be computed as 
s = a? = a” (42.1-14) 


It can be computed by squaring the element n — 1 times. But the square root is a linear function, so 
we can again apply table lookup methods. A method that uses the precomputed value yx is described 
in [145]: for an element a = Y, ax z^ we have 


ya = Y aP s Y aped. (42.1-15) 


k even k odd 


The only nontrivial operation is the multiplication with yx. If the field polynomial is of the form 
C = B?x+1, then yz = Bz. If the polynomial B has low weight, then the multiplication by yz is 


(2000 NOOB ON Ea 
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cheap. Moreover, no reduction by C is required when computing the product with y/x on the right side 


of relation 42.1-15| see [26]. Polynomials of this form also have a trace vector with only a single one, see 
section |42.3.3 on page 896 


Also every binary polynomial C can be expressed as C = A? +2 B?, so yx can be computed via modular 
division as 


Vz = A mod C (42.1-16) 


The method given as relation 42.1-15|can be generalized for higher roots: for example, for the third root 
we have 


Ya = S(0)+ YeS(1) + (Vu). S(2) (42.1-17a) 
= S(0)+ Wz [S(1) + Yz S(2)] (42.1-17b) 
where S(m) is defined as 
S(m) :— 5 aja 7090/5 (42.1-18) 
k=m mod 3 


42.1.5 Order and primitive roots 


The order of an element a is the least positive exponent r such that a” = 1. The maximal order of an 
element in GF(2”) equals 2" — 1 = Q— 1. An element of maximal order is called a generator (or primitive 
root) as its powers ‘generate’ all nonzero elements in the field. The order of a given element a in GF(2”) 


can be computed as follows (compare to section |39.7.1.2| on page|779): 


function order(a, n) 


if a==0 then return 0 // a not a unit 


h := 2**n - 1 // number of units 


:=h 
Tap: [], k[]} := factorization(h) // h==product(i=0..np-1, pli]**k[i]) 


for i:-0 to np-1 


pli] **k [3] 

e/f 

gl := a**e // modulo polynomial 

while gi!-1 

1 

gl := gi**p[i] // modulo polynomial 
e :7 e * plil 

pli] := pli] - 1 


} 


return e 


The C++ implementation is given in [FXT: bpol/gf2n-order.cc : 


ulong gf2n_order(ulong g, ulong c, ulong h, const factorization &mfact) 
// Return order of g Nin GF(2**n) with field polynomial c. 

// c must be irreducible 

// h must be equal 1<<(deg(c)-1) 

// mfact must contain the factorization of 2**deg(c)-1 

// The routine may loop if either: 

// - the polynomial c is reducible 

// - deg(g) >= deg(c) 

// - h is not set correctly 

// - mfact is not set correctly 


if ( 0==g ) return 0; // not in multiplicative group 


ulong m = mfact.product(); 
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15 ulong e = m; 

16 for (int i=0; i«mfact.nprimes(); ++i) 

17 1 

18 long p = mfact.prime(i); 

19 long f - mfact.primepow(i); 

20 

21 e /= f; 

22 

23 ulong g1 = bitpolmod_power(g, e, c, h); 
24 while ( g1!=1 ) 

25 { 

26 gl = bitpolmod_power(g1, p, c, h); 
27 e *- p; 

28 } 

29 } 

3) return e; 

32 } 


Let a be an element of order r, then the order of a* is 


ord(ak) = EE (42.1-19) 


The order remains unchanged if gcd(k,r) = 1. Let N = 2" — 1 and g a generator of GF(2”). Then all 
v (IN) generators are of the form g^ where ged(k, N) = 1. For N a (Mersenne) prime the order of all 
invertible elements except 1 is N. 


42.1.6 Implementation 
A C++ class for computations in the fields GF(2") with n not greater than BITS. PER. LONG is [FXT: 


class GF2n in |bpol/gf2n.h : 


class GF2n 
/ Implementation of binary finite fields GF(2**n) 


// with the arithmetic operations. 


1 
public: 
ulong v. ; 


Douek w yoe 


The static (that is, class global) elements support the computations: 


1 public: 

2 static ulong n_; // the ’n’ in GF(2**n) 

3 static ulong c_; // polynomial modulus 

4 static ulong h_; // auxiliary bit-mask for computations 

5 static ulong mm_; // 2**n - 1 == max order (a Mersenne number) 

6 static ulong g.; // a generator (element of maximal order) 

T static ulong tv_; // trace vector 

8 static ulong sqr_tab[BITS_PER_LONG]; // table for fast squaring 

9 static factorization mfact_; // factorization of max order 

10 static char* pc_; // chars to print zero and one: e.g. "01" or ".1" 
11 [--snip--] 

12 static GF2n zero; // zero (neutral element wrt. addition) in GF(2**n) 
13 static GF2n one; // one (neutral element wrt. multiplication) in GF(2**n) 
14 static GF2n trie; // an element with trace == 


Note all data is public, making many methods ‘get_something()’ unnecessary. You can modify the data, 
but unless you know ezactly what you are doing the results of subsequent computations are undefined. 
The constructors from other types are ‘explicit’ to avoid surprises: 


public: 
explicit GF2nO 1; } 
explicit GF2n(const ulong i) : v (4 & m.) 4; } 
GF2n(const GF2n &g) : v_(g.v_) {;} 
“GF2n() 41; > 


One has to call the initializer before doing any computations [FXT: bpol/gf2n.ccl: 


1 // if INIT ASSERT is defined, asserts are C asserts, 
2 // else init() returns false if one of the tests fail: 
3 #define INIT. ASSERT 
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4 
5 bool // static 
6 GF2n::init(ulong n, ulong c/*=0*/, bool normalq/*=0*/, bool trustme/*-0*/) 
7 // Initialize class GF(2**n) for O<n<=BITS_PER_LONG. 
8 // 1f an irreducible polynomial c is supplied it is used as modulus, 
9 // else a primitive polynomial of degree n is used. 
10 // Irreducibility of c is asserted. 
11 // If normalq is set, then a primitive normal polynomial is used, 
12 // if in addition c is supplied, then normality of c is asserted. 
13 // If trustme is set, then the asserts are omitted. 
14 4 
15 [--snip--] 
16 if (n. < BITS PER LONG ) // test only works for polynomials that fit into words 
17 
19 if ( ! trustme ) 
20 #ifdef INIT_ASSERT 
21 jjassert( bitpol_irreducible_q(c_, h_) ); 
22 #else 
23 if ( ! bitpol_irreducible_q(c_, h_) ) return false; 
24 #endif 
25 + 
26 
27 [--snip--] 
28 
n-74 GF(2*n) 
c = 1..11 == x 4* x + 1 (polynomial modulus) 
mn- .1111 == 15 = 3 %* 5 (maximal order) 
h = .1... (aux. bit-mask) 
g= ...1. (element of maximal order) 
tv- .1... (traces of x^i) 
trie- .1... (element with trace-1) 
k : f:-g**k Tr(f) ord (f) f*f sqrt (f) 
Stee E 2d 0 1 zard xl 
enzo dl sd 0 15 sls . 1.1 
eds Vs. 0 15 ..11 sad 
..11 1... 1 5 11.. 1.1. 
EE sali 0 15 ¿1.1 ele 
stad :11:; 0 3 .111 .111 
.11. 11.. 1 5 1111 T. 
.111 1.11 1 15 1..1 111. 
I... .1.1 0 15 sd. ..11 
Tact 1.1. 1 5 l5 1111 
1.1. ;111 0 3 : l4. .11. 
1.11 111. 1 15 1.11 11.1 
11.. 1111 1 5 1.1. 11.. 
11.1 11.1 1 15 111. 1..1 
111. 1.51 1 15 11.1 1.11 
Figure 42.1- A: Powers of the generator g = x in GF(2*) with a primitive polynomial modulus. 
The class defines all the standard operators like the binary operators ‘+’ and ‘-’ (which are the same 
operation in GF(2”)), ‘*’ and ‘/’, the comparison operators ‘==’ and ‘!=’, also the computation of inverse, 


powering, order, and trace. The algorithms used for the arithmetic operations are described in section 
on page 832| We give the method for the inverse and the arithmetic shortcut operators as examples: 


1 GF2n inv() const 

2 

3 GF2n z; 

4 Z.v 7 bitpolmod_inverse(v_, GF2n::c_); 

5 return z; 

6 $ 

7 [--snip--] 

8 

9 friend inline GF2n & operator *- (GF2n &z, const GF2n &f) 
10 { z.v_ ^= f.v.; return Z; } 

11 

12 friend inline GF2n & operator -= (GF2n &z, const GF2n &f) 
13 { z.v_ ^= f.v.; return Z; } 

14 

15 friend inline GF2n & operator *- (GF2n &z, const GF2n &f) 
16 { z.v_ = bitpolmod mult(z.v , f.v_, GF2n::c_, GF2n::h ); return Zz; } 


1 
1 


E 
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n=4 GF(2^n) 
c = 11111 [normal] [NON-primitive] 
== x^4 + x^3 + x°2 + x + 1 (polynomial modulus) 
mm- .1111 == 15 = 3 %* 5 (maximal order) 
h = .1... (aux. bit-mask) 
g = ...11 (element of maximal order) 
tv- .111. (traces of x7i) 
trie= .1. (element with trace=1) 

k : f:-g**k Tr(f) ord (f) f*f sqrt (f) 
TE Sal 0 1 ac ol 23s 
"t 11 1 15 .1.1 1..1 
salis sls 1 15 111. ..11 
11 1111 1 5 Tous sls 
dius 111. 1 15 Me 1.1 
«l1 11.1 0 3 11.. TE 
.11. 1. 1 5 «edis 1111 
.111 .111 0 15 1.1. 1.11 
n 1..1 1 15 ll 111. 
Lo li 1 5 1111 sd: 
1.1. 11.. 0 3 11.1 11.1 
1.11 1.11 0 15 .111 .11. 
11.. adis 1 5 Me PE i 
11.1 11. 0 15 1.11 1.1. 
111. : 1.1. 0 15 .11. .111 


8 
9 
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Figure 42.1-B: Powers of the generator x + 1 in GF(2*) with a non-primitive polynomial modulus. 


friend inline GF2n & operator /= (GF2n &z, const GF2n &f) 
{ z *- f.invO; return z; } 


A simple demonstration of the class usage is the program [FXT: gf2n/gf2n-demo.cc|. It prints the 


successive powers of a primitive root, their squares and square roots. By default computations in GF(2*) 
are shown, both for a primitive polynomial modulus (figure |42.1-A) and for a non-primitive polynomial 


modulus (figure [42.1-B). 
42.2 Minimal polynomials 


The minimal polynomial of an element a in GF(2”) is defined as the polynomial of least degree which 
has a as a root. The minimal polynomial can be computed as the product 


r-1 


Palt) := II (2-02) (42.2-1) 


k=0 


where r is the smallest positive integer such that a?) = a. The minimal polynomial of any element is 
irreducible and its degree is a divisor of n. 


The zeros of the polynomial are a, a2,a*,a9,...,a?" ^. Note that (a2” )? = a, that is, pa = Paz = Pas = 

es = passa. The elements a”, a*,... are called the conjugates of a. For the field GF(p”) the minimal 
z —1 k 

polynomial has the form [[;,_¿ (x — a? ). 


It can be seen from the definition that the coefficients of the minimal polynomial lie in GF(2”). However, 
all of them are either zero or one, they lie in GF(2). The computation has to be carried out using the 


arithmetic in GF(2") but the final result is a binary polynomial [FXT: bpol/gf2n-minpoly.cc |: 


ulong gf2n minpoly2(GF2n a, ulong &bp) 

// Compute the minimal polynomial p(x) of a Min GF(2**n). 
// Return the degree of pO. 

// The polynomial p() is written to bp 

1 


GF2n p[BITS, PER, LONG*1] ; 

ulong n = GF2n::n_; 

for (ulong k-0; k<=n; ++k) p[k] = 0; 
plo] = 1; 

ulong d; 
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(maximal order) 


(polynomial modulus) 


== x°6 +x + 1 
== 63 = 3°2* 7 


GF (2^n) 
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that are non-primitive are marked with an ‘N’. The trace of an element equals the coefficient of z"-! of 


Figure 42.2- A: Minimal polynomials of the powers of a generator in GF(2°). Polynomials of degree n = 6 
its minimal polynomial. 
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11 GF2n s = a; 

12 d (d=1; d<=n; ++d) 

14 for (ulong k=d; O!-k; --k) p[k] = p[k-11; 

15 plo] = 0; 

16 for (ulong k-0; k<d; ++k) p[k] += p[k*1] * s; 
17 s = s.sqrO; 

18 if(s==a) break; 

19 } 

21 // Here all coefficients are either zero or one, 
22 // so we can fill them into a binary word: 

23 ulong p2 = 0; 

24 for (ulong j=0; j<=d; ++j) p2 l= (plj].v_ << j); 
25 bp = p2; 

38 return d; 

28 } 


The algorithm needs O(n?) space. An algorithm requiring only space O(n) is given in [161]. Compute 
the element Ma € GF(2”) as 


Ma = I (z — a?) (42.2-2) 
k=0 


where x is a root of the field polynomial and r is the smallest positive integer such that a? = a. 
If r = deg(c) then return m, + c, otherwise return Mma. The returned value has to interpreted as a 
polynomial. An implementation is 


1  ulong 
2  gf2n minpoly(GF2n a, ulong &bp) 
3 
4 if (a.v. <2) 1(bp= 2" a.v_; return 1; > 
5 
6 const GF2n x(2UL); // a root of the polynomial GF2n::c_ 
7 GF2n s = a; 
8 GF2n m(1UL); // minpoly 
9 ulong d = 0; // degree 
10 do 
12 m *= (x - s); // x - a (2d) 
13 ttd; 
14 s = s.sqrO; 
} 
16 while (s !=a); 
18 if ( d--GF2n::n ) m.v_ ^= GF2n::c_; 
19 bp = m.v_; 
39 return d; 
22 > 


Versions of the routines that do not depend on the class GF2n are given in [FXT: bpol/bitpolmod- 
minpoly.ce . The program [FXT: gf2n/gf2n-minpoly-demo.cc| prints the minimal polynomials for the 
H2.2-A| 


powers of a primitive element g, see figure |42.2-A| The polynomials (of maximal degree) that are non- 
primitive are marked by an ‘N’. The minimal polynomials for g^ are non-primitive or of degree < n 
whenever gcd(k, 2" — 1) z 1. 


Let C be an irreducible polynomial of degree n and a an element such that r = ordo (a) is the order of a 
modulo C. Then the order of x modulo the minimal polynomial of a is also r. Thus a primitive polynomial 
can be determined from an irreducible polynomial C and a generator g modulo C by computing the 
minimal polynomial of g. 


With a primitive polynomial (and the generator g = x) the minimal polynomial of the element z^ is 
primitive if k is a Lyndon word and gcd(k,n) = 1. With a fast algorithm for the generation of Lyndon 
words we can therefore generate all primitive polynomials as shown in section |40.10| on page [856] 
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42.3 Fast computation of the trace vector 


We give two methods for the computation of the trace vector and some properties of the trace vector for 
certain field polynomials. 


42.3.1 Computation via Newton's formula 


Let C(x) be a polynomial with n roots ao, 41, ..., @n—1 
n—1 n 
C(r) = II (x—a&) = 3 p (42.3-1) 
k=0 k=0 


Following [335] sec.32] we define 


sk = af +ak+...+a%_, (42.3-2) 
Then, for m = 1,...,n, we have Newton’s formula: 
m-1 
Men-m = — 5 Mi (42.3-3) 
j=0 


Now let C = co + c1 £ + c2 £? +... +p 2” be an irreducible polynomial with coefficients in GF(p). Its 
roots are a, (and the conjugates) a?, aP’, aP^, ..., aP”. Let to =n and t; = Tr(a’) (computationally x 
is a root of C, so t; = Tr(=*)). Note that to,...,tn—1 are the elements of the trace vector, see relation 
[42.1-11]on page [888 Using cn = 1 (monic polynomial C) and t; = s; we rewrite Newton’s formula as 


u = -leni (42.3-4a) 
tg = —Cn-1t1 — 2Cn-2 (42.3-4b) 
t3 = —Cn-1t2— Cn—2 t1 — 3 Cn-3 (42.3-4c) 
ty = —Cy_1t3 — Cn—2 te — Cn—3 t1 — 4 Cn—4 (42.3-4d) 
ts = —Cn-1t4— Cn—2 t3 — Cn-3 ta — Cn—4 t1 — 5Cn-5 (42.3-4e) 
tk = —Cn-1tk-1 — Cn—2tk—2 — -..— Cn—-k—1t1 — Ke (42.3-4f) 


To compute the trace vector for the field GF(p”), make the assignments in the given order, and finally 
compute to = n mod p. The computation does not involve any polynomial modular reduction so the 
method can be worthwhile even for the determination of the trace of just one element. 


With binary finite fields, the components with even subscripts can be computed as tax = tx. During the 
computation we set tọ = 0 and correct the value at the end of the routine. An implementation of the 


implied algorithm is [FXT: |bpol/gf2n-trace.cc): 


1 ulong gf2n trace vector x(ulong c, ulong n) 
2 // Return vector of traces of powers of x, where 
3 // x is a root of the irreducible polynomial C. 
4  // Must have: n == degree(C) 
5 t 
6 c &- ~( 2UL««(n-1) ); // remove coefficient c[n] 
7 
8 ulong t = 1; // set t[0]=1, will be corrected at the end 
9 for (ulong k=1; k<n; ++k) 
1 
11 d (k&1) // k odd: use recursion 
13 ulong cv = c >> (n-k); // polynomials coefficients [n-1]..[n-k] 
14 cv &- t; // vector (j=1, k, c[n-jl*t[k-j1) 


15 cv 7 parity(cv); // sum (j=1, k, c[n-jl*t[k-j1) 
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n t |= (cv<<k); 

else // k even: copy t[k/2] to t[k] 
1 

: t l= ( (t>>(k/2)) & 1) << kj 


// correct t[0]: 
t “= ((n*1)&1); // change low bit if n is even 


return t; 


} 
The routine involves n computations of the parity. The equivalent routine for large n is O(n?) (n 
computations of sums with O(n) summands). 
42.3.2 Computation via division of power series 


The following variant of the algorithm, suggested by Richard Brent [priv. comm.], shows that the compu- 
tation is equivalent to a division of power series. Let R be the reciprocal polynomial of C, then (see [76] 
p.135]) 


lg(R(r) = -$ t/j (42.3-5) 
j=l 
Differentiating both sides gives 
R(x) al 
= -Ñ` tai (42.3-6) 
R(z) 21 


When using Newton's method for the inversion we have a computational cost of y M(n) where M(n) 
is the cost for the multiplication of two power series up to order z” and y is a constant (the method 
is also given in [153] p.24]). The constant ^ equals 3 if the division is done by one inversion, which is 
two multiplications with the second order Newton iteration, and one final multiplication with R'(x), see 
section [29.1.1]on page [567] For large n the multiplications should be done by either one of the splitting 
schemes suggested in [59] or by FFT methods such as given in [303]. 


42.3.3 Some properties of the trace vector 


For a binary polynomial C of odd degree n and all nonzero coefficients c; at odd indices i we obtain 
to = 1 and t; = 0 for all i 4 0, so the trace of any element is just the value of its lowest bit. In [57] it 


is shown that for n = +3 mod 8 the first nonzero coefficient cz (with k « n) must appear at a position 
k > n/3. 

With even degree and all nonzero odd coefficients ci, cj, Cp, ...at positions i,7,k,--- < n/2 the only 
nonzero components of the trace vector are f£, i, tn-j, t4 &, .... Thus polynomials of even degree 


with just one nonzero coefficient c; where k < n/2 lead to only t,_, being nonzero. A special case are 
trinomials C = z” +x" +1 of even degree n and k < n/2 (k must be odd else C is reducible). In the trace 


vector for all-ones polynomials (C — b» z^, see section 40.9.9 on page 852) the only zero component 


is to. A detailed discussion of the properties of the trace vector is given in |5]. 


42.4 Solving quadratic equations 


We want to solve, in GF(2"), the equation 
axr’ +br+c = 0 (42.4-1) 
Extracting a square root of an arbitrary element in GF(2") is easy, but this does not enable us to solve 


the given equation. The formula ro, = (=> + Vb? — ac) /(2a) that works fine for real and complex 
numbers is of no help here: how should we divide by 2? 
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n= 5 GF(2°n) 
c= 1..1.1 == xb + x°2 + 1 (polynomial modulus) 
mm- .11111 == 31 (prime) (maximal order) 
h = .1.... (aux. bit-mask) 
g= .1. (element of maximal order) 
tv= ..1..1 (traces of x^i) 
trie= ..... 1 (element with trace=1) 
k: f:=g**k Tr(f) RootOf(z*2tz=f) 
0: enl 1 
1: ewes 0 Lo. d 
2: sadace 0 . 1.11 
3: Las 1 
4: 1. 0 .1111 
5: idu 1 
6: bx. 1 
E 1.1. 0 Rare eee 
8: .11.1 0 11111 
9: 11.1. 1 
10: 1...1 1 
11: ..111 1 
12: 111. 1 
13: 111.. 1 
14: 111.1 0 1. 
15: 11111 0 dl. dl 
16: 11.11 0 1..1. 
17: 1..11 1 
18: 2... dd 1 
19: bd 0 ss 
20: zl. 1 
21: 11... 1 
22: 1.1.1 1 
23: .1111 0 1.11. 
24: 1111. 1 
25: 11..1 0 11.11 
26: 1.111 1 
27: .1.11 0 111.1 
28: 1:11. 0 .11.1 
29: e Lal 0 qd 
30: 1... 0 pare ae 


CONOoBRWN rH 


Figure 42.4-A: Solutions of the equation z? + z = f for all elements f € GF(2°) with trace zero. 


Instead we transform the equation into a special form: divide by a: x? + (b/a) x + (c/a) = 0, substitute 
x = z (b/a) to get z? (b/a)? + (b/a)? z + (c/a) = 0, and divide by (b/a)? to obtain 

2+z+C = 0 where C= 7 (42.4-2) 
If ro is one solution of this equation, then rı = ro + 1 is the other one: z (z +1) = C. The equation does 
not necessarily have a solution at all: the trace of C must be zero because we have Tr(C) = Tr(z? + z) = 
Tr(22) + Tr(z) = Tr(z) + Tr(z) = 0 for all z e GF(2”). 


The following function checks whether the reduced equation has a solution and if so, returns true and 
writes one solution to the variable r [FXT: bpol/gf2n-solvequadratic.cc|: 


bool gf2n. solve reduced quadratic(GF2n c, GF2n4 r) 
// Solve z^°2 + z = c 
if ( 1==c.trace() ) return false; 
GF2n t( GF2n::trie ); 
GF2n z( GF2n::zero ); 
GF2n u( t ); 
for (ulong j=1; j<GF2n::n_; ++j) 
{ 
GF2n u2 = u.sqrO; 
z = z.sqr(); z += u2*c; // z= z*z + c*u*u 
u = u2 + t; // u = užu +t 
} 
r= Z; 


return true; 
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------- k=1: ------- 
u-t^2 + t 
z-c*t^2 
z^2-c^2*t^4 
z^2*z-c^2*t^4 + c*t^2 
------- k=2: ------- 
u-t^4 + t°2 +t 
z-(c^2 + c)*t^4 + c*t^2 
z^2-(c^4 + c^2)*t^8 + c^2*t^4 
Z^2*z-(c^4 + c^2)*t^8 + c*t^4 + c*t^2 
-——— k=3 Zas 
u-t^8 + t744+t7°2 +t 
z-(c^4 + c^2 + c)*t^8 + (c^2 + c)*t^4 + c*t^2 
z^2-(c^8 + c^4 + c^2)*t^lO + (c^4 + c^2)*t^8 + c^2*t^4 
Z^2*z-(c^8 + c^4 + c^2)*t^16 + c*t^8 + c*t^4 + c*t^2 
=(c78 + c^4 + c^2)*t + c*(t^8 + t^4 + t^2) [using t^i6-t] 
=( c + trace(c) )*t * c*( t * trace(t) ) [using x^8*x^4*x^2-x*trace(x)] 
-(c*0 )*t +cr(t+1) [using trace(c)=0, trace(t)=1] 
= c [ z^2+z==c ] 


Figure 42.4-B: Solving the reduced quadratic equation 2? + z = c in GF(2*). 


Figure |42.4-A| shows the solutions to the reduced equations z? + z = f for all elements f with trace zero 
[FXT: gf2n/gf2n-solvequadratic-demo.cc|. 


The implementation of the algorithm takes advantage of a precomputed element with trace one. At the 
end of step k > 1 we have 


k 

uy = NU (42.4-3a) 
j=0 
k-1 k-1 — 

zk = Yee (42.4-3b) 
j=0 i=0 


Figure 42.4-B| shows (for GF(2*)) that this expression is the solution sought. 
For GF(2") with n odd the solution of the reduced quadratic equation z? + z = A can be computed via 
the half-trace of A which is defined as 

H(A) = AAHAS g... AMT (42.4-4) 


We have H(A)? + H(A) = Tr(A) + A, so H(A) is a solution of the reduced quadratic if Tr(A) = 0. 
The half-trace of an element A in the field with field polynomial C can be computed as follows [FXT: 


bpol/gf2n-trace.cc : 


ulong 
gf2n half trace(ulong a, ulong c, ulong h) 
1 
ulong t = a; 
ulong d = h; 
Qus ( d>>=2 ) 
t = bitpolmod square(t, c, h); 
t = bitpolmod square(t, c, h); 
t “=a; 


return t; 


Ree 
WHOM OND O40 NA 


The following routine computes both solutions. It transforms the equation into the reduced form, solves 
it, and finally transforms back [FXT: bpol/gf2n-solvequadratic.cc|: 


1 bool gf2n. solve quadratic(GF2n a, GF2n b, GF2n c, GF2n& r0, GF2n& r1) 
2 // Solve a*x*2 + b*x + c == 
3 // Return whether solutions exist. 


4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
i 
17 
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// If so, the solutions are written to rO and rl. 
GF2n cc = axc; 
cc /= (b.sqrO); // cc = (a*c)/(b*b) 
GF2n r; 
bool q = gf2n solve reduced quadratic(cc, r); 
if ( !q) return false; 
GF2n s = b / a; 
rO =r * s; 
ri = (r+GF2n::one) * s; 
return true; 
} 


Routines for the solution of quadratic equations that do not depend on the class GF2n are given in [FXT: 


bpol/bitpolmod-solvequadratic.cc|. 


42.5 Representation by matrices 1 


n= 4 GF(27n) 

c=1..11==x"4+x + 1 (polynomial modulus) 
k: f:-g**k 
O: sead 
1: seals 
2: ee lito dd. 111 
3: ; 1::11.1.1111.. 
4: sald 1..11.1.1111. 
5: sis 1..11.1.1111 
6: Tis; 
Ts 1.11 
8: 1.1 M_O = M_1 = M_2 = M_3 = M_4 = M_5 = 
9: L.1. 1 sedi aces des 1..1 sdb 
10: .111 1:1 soil .11. 11.1 1.1: 
11: 111. 1 1 1..1 sed .11. 11.1 
12: 1111 1 1 E 1..1 ..11 .11. 
13: 11.1 
14: Lcd 


Figure 42.5-A: Powers of the primitive element in GF(2*) with field polynomial z^ + x + 1 (left), the 
list of powers rotated counter clockwise by 90 degree (top right), and the matrices obtained by taking 4 
consecutive columns of the list (bottom right). 


With a primitive polynomial modulus C(x) a representation of the elements of GF(2") as matrices can be 
obtained from the powers of the generator (‘x’) as follows: take row k through k-- n — 1 as the columns of 
matrices Mz, as shown in figure [12.5-A Now we have Mj, = Mf, so we can use the matrices to represent 
the elements of GF(2”). 


The matrix M :— Mj is the companion matrix of the polynomial modulus g. The companion matrix of 


a polynomial p(x) = x" — o c; xt of degree n is defined as the n x n matrix 
0.00 --- 0 0p 
100... 0 ci 
0 1.0 - 0 oc 
END db = S (42.51) 
0 0 0 >> 0 c2 
0.0 0 > 1 ca 
For polynomials p(x) = 15; 4 ai 2" set c; :— —a;/an. 


The characteristic polynomial c(x) of an n x n matrix M is defined as 


c(r) :— det(x E, — M) (42.5-2) 
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where Enp is the n x n unit matrix. The roots of the characteristic polynomial are the eigenvalues of the 
matrix. The characteristic polynomial of the companion matrix of a polynomial p(x) equals p(x). If p(x) 
is the characteristic polynomial of a matrix M, then p( M) = 0 (non-proof: set x= M in relation [42.5-2] 
for a proof see [101]). 


k [ p_k(x) ]^d k [ p. kGO Ja 
0 [ 11 ]^4 T [ 1..11 ]^1 
1 [ 11..1 ]^1 8: [ 11..1 ]^1 
2 [ 11..1 ]71 9: [ 11111 ]71 
3 [ 11111 ]^1 10: [ 111 ]^2 
4 [ 11..1 ]^1 11: [ 1..11 ]^1 
5 [ 111 ]^2 12: [ 11111 ]^1 
6 [ 11111 ]^1 13: [ 1..11 ]^1 
14: [ 1..11 ]^1 


Figure 42.5-B: Characteristic polynomials of the powers of the generator x with the field GF(2%) and 
the polynomial z* + z + 1. 


Let cy (1) be the characteristic polynomial of the matrix M; = M* and py (x) the minimal polynomial of 
the element g^ € GF(2”). Then 


crla) = [pk(x)? where d=n/r (42.5-3) 


where r is the smallest positive integer such that MP = My. For example, for the primitive modulus 
C(x) = xf +x +1 the sequence of characteristic polynomials of the powers of the generator ‘x’ are shown 


in figure |42.5-B 


The trace of the matrix M* is the d-th power of the polynomial trace of the minimal polynomial of g*. 
The polynomial trace of p(x) = 2” — (Cn-1 2°71 +--+ ez + co) equals c,_1 as can be seen from 


relation |42.5-1 


By construction, picking the first column of Mp gives the vector of the coefficients of the polynomial z^ 
modulo C(x): 


ME T 9,0, ::0]7 x! mod C(x) (42.5-4) 


Il 


Finally, the characteristic polynomial of an element a € GF(2") in polynomial representation can be 
written as 


pz) := JI 6 " a") (42.5-5) 
k=0 
Compare to relation 42.2-1|on page for minimal polynomials. 


42.6 Representation by normal bases 


So far we used the n basis vectors 2%, z!, x?, x3,...,2"~1 to represent an element a € GF(2”) (as a vector 
space over GF(2)): 


n—1 
a = Voas (42.61) 
k=0 


The arithmetic operations were the polynomial operations modulo an irreducible polynomial modulus C. 


For certain irreducible polynomials (which are called normal polynomials or N-polynomials) it is possible 


to use the normal basis x, 2?,2+,28,...,22" to represent elements of GF(2”): 


n—1 


a = X aa” (42.6-2) 
k=0 
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To check whether a polynomial C is normal, compute rz, = 22 mod C for 1 < k € n, compute the 
nullspace of the matrix M whose k-th row is rz. If the nullspace is empty (that is, M - v = 0 implies 
v — 0), then the polynomial is normal. 


'The normality of a polynomial is equivalent to its roots being linearly independent. See section |40.5.1 
on page 841|for the equivalence of computations modulo a polynomial and computations with linear 
combinations of its roots. 


An element f € GF(2”) where f1, f2, f4, f8,..., f?" are linearly independent is called a normal element 
(or free element). The minimal polynomial of a normal element f is normal. 


Addition and subtraction with a normal basis is again a simple XOR. Squaring is a cyclic shift by one 
position. Taking the square root is a cyclic shift in the other direction. 


In normal basis representation the element one is the all-ones word. So adding one is equivalent to 
complementing the binary word. 


'The trace can be computed easily with normal bases, it equals the parity of the binary word. 


42.6.1 Multiplication and test for normality 


For the multiplication of two elements we use the multiplication matrix M. Given two elements a,b € 
GF(2") in normal basis representation 


n—1 n—i 
a = y» a, b= y by x? (42.6-3) 
k=0 k=0 
their product p = a - b can be computed as follows: for the first component po of the product we have 
po = aT.M-b (42.6-4) 
and in general 
L9k-1 T L9k-1 
Pe = (a ) .M -b (42.6-5) 


That is, all components of the product are computed like the first, but with a and b cyclically shifted. 


Normal poly: c=11111 ="= 4,3,2,1,0 Normal poly: c=111.11 -^- 5,4,2,1,0 
A= A^-1- A= A^-1- 
Ius 1111 Se 11111 
mm Mia idis. d.a 
1111 diu. isses uo dois 
me -— zd. 1..1. 
1.111 sedes 
C^T- D-A*C^T*A^(-1]- 
idis zd C^T- D-A*C^T*A^(-1]- 
s edes T" ai cdi 
asian 1111 sd 1..1. 
1111 ess Pe des wie dd 
rend gal 
Multiplication matrix: M= 111.1 exl 
coda 
¿dde Multiplication matrix: M= 
11... 2d ss 
IX. 11. 
sse d 
Po 
..1.1 


Figure 42.6-A: Matrices that occur with the computation of the multiplication matrix for the field 
polynomials c = 1 + z + z? + 23+ z^ (left) and c = 1 +x +x? + zt + gä (right). 


An algorithm to check whether a given polynomial c is normal and, if so, compute the multiplication 
matrix M can be given as follows: 


E 


E 
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1. If the polynomial c is reducible, return false. 


2. Compute the matrix A whose k-th row equals 22 mod c. If A is not invertible, then (the nullspace 
is not empty and) c is not normal, so return false. 


3. Set D :— A. CT . AT! where C is the companion matrix of c. 


4. Compute the multiplication matrix M where Mi; :— Dj, t := —i mod n and j' := j — i mod n. 
Return (true and) the matrix M. 


An implementation is given in [FXT: bpol/bitpol-normal.cc|. Examples of the intermediate results for 
42.6-A 


two different field polynomials are given in figure 


A C++ function implementing the multiplication algorithm is [FXT: |bpol/normal-mult.cc |: 


1 ulong 

2 normal mult(ulong a, ulong b, const ulong *M, ulong n) 
3  // Multiply two elements (a and b in GF(2^n)) in normal basis representation. 
4  // The multiplication matrix has to be supplied in M. 
5 

6 i ulong p = 0; 

T for (ulong k-0; k<n; ++k) 

8 1 

9 ulong v - bitmat mult Mv(M, n, b); // M*b 

0 v = parity( v & a ); // a*M*b (dot product) 

1 p-=(v<<k); 

2 a = bit_rotate_right(a, 1, n); 

3 b = bit_rotate_right(b, 1, n); 

4 } 

5 return p; 

6 } 


The routine for the multiplication M - v7 of a binary vector by a matrix is given in [FXT:|bmat /bitmat- 


inline 


1 inline ulong bitmat_mult_Mv(const ulong *M, ulong n, ulong v) 
2 t 

3 ulong p = 0; 

4 for (ulong j=0; j<n; ++j) 

5 { 

6 ulong t = parity( M[j] & v ); 

7 p l= (t<<j); 

8 Y 

9 return p; 

0 Jj 


A multiplication v - M is more efficient: 


1 inline ulong bitmat mult. vM(const ulong *M, ulong n, ulong v) 
2 4 

3 ulong p = 0; 

4 for (ulong j=0; j<n; ++j) 

5 

6 if (vel) p ^= M[jl; 

T v >>= 1; 

8 } 

9 return p; 

0 +} 


So we modify two lines in the loop: 


1 ulong v = bitmat mult vM(M, n, a); // a*M 
2 v = parity( v& b ); // a*M*b (dot product) 


The algorithm for multiplication with normal bases is much more attractive for hardware implementations 
than for software, see [120]. An alternative test for normality is given in section 42.6.4| on page [908] 


E 
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Normal poly: c=1111.1 == x"5 + x"4 + x"3 + x"2 + 1 
k = f-g**k Tr(f)  x^2*x-- 
0 2 esi 11111 1 
(NLIS ee 1 
2=...1.: sels 1 
3 = ...11: .111. 1 
4-14. % ee ee 1 
5 = ..1.1: 1.111 : x=.11.1 
6:= di. 3: 111.. 1 
T = ..111: .11.1 1 
o E ee 1 
Q= .1..1: 111.1 : x-.1.11 

10 = .1.1. : .1111 : =..1.1 
11 = .1.11 : ss EN : esci 
12 = .11.. : 11..1 1 

13 = .11.1 : 11. : x-.1 

14 = .111. : 11.1. 1 

15 = .1111 : 1.1.. : x-.11 
16 2 1.... 3 Ls 1 

1f 2 1.1: 2.411 1 

18 = 1..1. : 11.11 3 x-.1..1 
19 = teii: 1.11. 1 

20 = 1.1.. : 1111. : x-.1.1. 
21 = 1.1.1: sonk : x=....1 
22 = 1.11. : ¿Mi : x=..1. 
23 = 1.111 : lls des x=..11 
24 = 11... : 1..11 1 

25 = 11..1: .1.11 1 

26 = 11.1. : Lec th : x-.1111 
27 = 11.11 : ¿Ll x=...11 
28 = 111.. : 1.1.1 1 

29 = 111.1 : 1:1. =.111. 
30 = 1111. : dual =..111 


Figure 42.6-B: Solving the reduced quadratic equation x? + « = f for powers f = g* of the generator 
g =x. The equation is solvable if the trace is zero, that is, the number of ones in the normal representation 
is even. The (primitive) field polynomial is 1 + £? + z? + 24 +2°. 


42.6.2 Solving the reduced quadratic equation 


The reduced quadratic equation x? + r = f has two solutions if Tr(f) = 0. One solution x = 
[Zo, 21,..., 2&1] can be computed as zy = pue fx where f = [fo, fi,---,fn—1]. This follows from 
e+e = [zo + Z&—1, £o + 21, £1 + £2, ..., Ln-2 + @n—-1, Ln-1 + Zo] (42.6-6) 


Now equate z?--z = f and set z, 4 = 0 (setting 1,1 = 1 gives the complement which is also a solution). 
In C++ this translates to (see section |1.13.5|on page [32) [FX T: bpol/normal-solvequadratic.h 


1 inline ulong normal, solve, reduced, quadratic(ulong c) 

2 // Solve x°2+x=c 

3 // Must have: trace(c)--0, i.e. parity(c)== 

4 // Return one solution x, the other solution equals 1+x, 
5 // that is, the complement of x. 

6 4 

7 return inverse_rev_gray_code(c) ; 

8 P 


The highest bit of the result is zero if and only if the equation is solvable: Tr(c) — 0, the vector c has an 
even number of ones. The reversed Gray code is given in section |1.16.6 on page 45 


A function to compute the trace and solve the reduced quadratic (if possible) is 


1 inline ulong normal solve, reduced quadratic, q(ulong c, ulong &x) 
2 // Return t, the trace of c. 

3 // If t==0 then x^2 + x = c is solvable 

E // and a solution is written to x. 

6 x = inverse_gray_code(c) ; 

7 ulong t= (x& 1); 

8 x >>= 1; // immaterial if t==1, but avoid branch 

9 return t; 

0 P 
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The program [FXT: gf2n/normalbasis-demo.cc| prints the powers E of a generator g in normal basis 


representation and solves 1% + x = g” when possible, see figure |42.6-B| By default a primitive normal 


polynomial from [FXT: bpol/normal-primpoly.cc| is used. 


42.6.3 The number of binary normal bases 1 


A n: An n: An n: An 


3 


1: 1 11: 93 21: 27783 31: 28629151 
2: 1 12: 128 22: 95232 32: 67108864 
3: 1 13: 315 23: 182183 33: 97327197 
4: 2 14: 448 24: 262144 34: 250675200 
5: 3 15: 675 25: 629145 35: 352149525 
6: 4 16: 2048 26: | 1290240 36: 104643072 
T: 7 17: 3825 27: 1835001 37: | 1857283155 
8: 16 18: 5376 28: | 3670016 38: | 3616800768 
9: 21 19: 13797 29: | 9256395 39: | 5282242875 
10: 48 20: 24576 30: 11059200 40: 12884901888 


Figure 42.6-C: The number An of degree-n binary normal polynomials up to n = 40. 


n: B, n: Brn n: Bn n: By 
I; 1 11: 87 21: 23579 31: 28629151 
2 1 12: 52 22: 59986 32: 33552327 
3: 1 13: 315 23: 178259 33: 78899078 
4: 1 14: 291 24: 103680 
5: 3 15: 562 25: 607522 
6: 3 16: | 1017 26: | 859849 
T: 7 17: 3825 27: 1551227 
8: 7 18: 2870 28: 1815045 
9: 19 19: 13797 29: 9203747 

10: 29 20: 11255 30: 5505966 


Figure 42.6-D: The number B,, of degree-n binary primitive normal polynomials up to n = 33. 


The number Apn of degree-n binary normal polynomials up to n = 40 is given in figure|42.6-C| A table of 
the values A, for 1 € n < 130 and their factorizations is given in [FXT: data/num-normalpoly.txt|. The 
sequence A, is entry |A027362 in [312]. The number B,, of degree-n binary primitive normal polynomials 


up to n = 33 is given in figure |42.6-D| This is sequence /A107222 in [312]. 
42.6.3.1 Computation via exhaustive search 


For small degrees all normal polynomials can be generated by selecting from the irreducible polynomials 
those that are normal. Using the mechanism that generates all irreducible polynomials via Lyndon words, 
which is described in section the computation is a matter of minutes for n « 25. The 
program [FXT: prints all normal polynomials of a given degree n, its output 
for n — 9 is shown in figure We can compute the number of normal (4,) and primitive normal 
(Bn) binary polynomials for small degrees n using that program. The table of the values B,, in figure 
[12.6-D] was produced with the mentioned program, the computation up to n = 30 takes about 90 minutes. 
As noted in [165], no formula for the number of primitive normal polynomials is presently known. The 
proof that primitive normal bases exist for all finite fields is given in [231.. 


42.6.3.2 Cycles in the De Bruijn graph 
Quite surprisingly, it turns out that A, equals the number of cycles in the De Bruijn graph (see section|41.5| 


on page 873} and section |20.2.2 on page 395). Therefore for n a power of 2 the number A, equals the 
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1: c= 11..11...1 P == x79 + x^°8 + xb-t^*x^4-41 

2: c= 11...1..11 P == x79 +x78+x°4+x+1 

3: c= 11111...11 P == x79 + x^°8 + x77 + x76+ x75 + x+1 
4: c= 11.11.1.11 P == x^9 + x8 + x^°6 + x 5b +x°3+x+1 
5: c= 111....1.1 P == x^9 + x8 x77 +x72+ 1 

6: c= 111...1111 P == x^9 + x8 + x + x73 + x'2* x *1 
Tio me dios. 1 == x79 + x78 + 1 

8: c = 11.111..11 P == x79 +x78+x76+x75+x74+x4+1 
9: c= 1111.1.1.1 P == x79 + x78 + x77 + x76+ x74 + x72 4+ 1 
10: c = 1111..1.11 P x^9 + x:'8O + x77 * x76+ x'3* x *1 
11: c = 11.1.11.11 P x79 + x^°8 + x76+x°4+x°3+x+1 
12: c — 11.1..1..1 x^9 * x:'8-* x'6* x 3-41 

13: c = 1111111.11 P x79 + x8 + x77 + x76+ x75 + x744+ x73 + x41 
14: c = 1111...111 P x^9 + x^°8 + x * x76+ x'2* x *1 
15: c= 11...1.1.1 P == x^9 + x^°8 + x4* x°2 + 1 

16: c = 11.1..1111 P == x79 + x^°8 + x^°6 + x°3 + x'2* x *1 
17: c= 11...11111 P == x79 +x78+x744+ x73 + x'2* x *1 
18: c = 111.111..1 P == x^9 + x8 + x77 + x^°5 + x^°4 + x^°3 + 1 
19: c = 1111.11..1 P == x79 + x8 + x77 + x76+ x 4*4 x;3-* 1 
20: c= 11..111.11 P == x79 + x78 + x75 + x74+ x73 +x+1 
21: c» 11.11.11.1 P == x^9 + x^°8 + x76+ xb + x^°3 + x' 2*1 


Figure 42.6-E: All normal binary polynomials of degree 9. Primitive polynomials are marked with ‘P’. 


number of binary De Bruijn sequences of length 2n. No isomorphism between both objects (paths and 
binary normal polynomials) is presently known. 


42.6.3.3 Invertible circulant matrices 


Est incio L=1.11. [s] L=11111. 
M= M = M = 
dones Tels 11111.. 
deser £111. .11111. 
A sekati ..11111 

Pee eee restati 1..1111 

jl. li. Ll 11... 111 
why iene 1. 11.1. T11..11 
m 1 llo | 1111..1 

L=111.. L=11..1 L=1111.1. 
M = M= M = 
111: 11..1. 1111.1. 
lil... 211.1 .1111.1 
scs del lass Peres bs eee | 1.1111. 
dpe d dd 1..11. .1.1111 
«sj il Pe eee 11 T1:15111 
l.i duced 11.1.11 
1i... 1:1. 111.1.1 

L=11.1 [s] L=1.1.1 L=111.11. 
M = M = M = 
11. TL. 1.1.1. 111.11. 
dd. Ls 1.1.24 .111.11 
ssl des Pare ees Ree | 1.111.1 
wd Led 1:1:1 11.111. 
1-41: 1.1.1 11.111 
eis one Ad vise di 1.24.41 
dades l e ee | 11.11.1 


n-7 #invertible=7 #singular=2 


Figure 42.6-F: The length-7 Lyndon words of odd weight and the corresponding circulant matrices. 
Singular matrices are marked with ‘[S]’. Dots denote zeros. 


'The number A,, of binary normal bases also equals the number of invertible circulant n x n matrices over 
GF(2). This is demonstrated with [FXT: [gf2n/bitmat-circulant-demo.cc], the output for n = 7 is shown 
in figure The search uses only Lyndon words, as periodic words would trivially lead to singular 
matrices. Further, Lyndon words with an even number of ones can be skipped as the vector [1,1,1,..., 1] 
is in the nullspace of the corresponding matrices. 


If the set (o, 02, at, 08, . .., o2" 7) is a normal basis of GF(2"), we say that a generates the normal basis. 

Consider the first row of a circulant matrix as some element ĝ in a normal basis representation. Then 
n—1 

the following rows are £?,/04, 85,..., 6?" ^ and the matrix is invertible if 8 generates a normal basis. If 
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2 


a generates a normal basis, then an element 8 = 2 ài a i generates a normal basis if and only if the 


polynomial py a; xt is relatively prime to z” — 1. Thus, with a fast algorithm to generate Lyndon 
words, determine all elements that generate normal bases if one such element is known as follows: select 
the Lyndon words with an odd number of ones and test whether gcd(L(x), x^ — 1) = 1 where L(x) is the 
binary polynomial corresponding to the Lyndon word. If n is a power of 2, then x” — 1 = (x — 1)” and 


all Lyndon words with an odd number of ones are coprime to z^ — 1. 


L(x) = 111.11. W(x) = 1....11 L(x) = 1.1.1.. W(x) = 1..1111 
M = M^-1 = = M^-1 = 
111.11. ee LL I 1.12. 1.1111 
.111.11 L1..1 1.1.1. 11..111 
1.111.1 Lll... 1.1.1 111..11 
11.111. Pa i e 1.1.1. 1111..1 
;11.111 sell. 1.1.1 11111.. 
1.11.11 «111. Lets... .11111. 
11.11.1 «111 abs L ..11111 


Figure 42.6-G: The inverse of an n x n circulant matrix can be found by computing the inverse W (x) 
of its first row as a polynomial L(r) modulo x" — 1. 


If the Lyndon word under consideration is taken as a polynomial L(x) over GF(2), then the corresponding 
matrix is invertible if and only if gcd(L(x),z" — 1) = 1. The first row of the inverse of a circulant 
matrix over GF(2) can be found by computing W(x) = L(x) ! mod z” — 1 where L(x) is the binary 
polynomial with coefficients one where the Lyndon word has a one. As the inverse of a circulant matrix 
is also circulant, the remaining rows are cyclic shifts of W(x). Two examples with n = 7 are shown in 


figure |42.6-G 
The equality of the number of invertible circulants and normal bases can also be seen as follows: choose 
a normal basis and test for each element f whether the elements f!, f?, f^, f9,..., f?" are linearly 


independent. As squaring is a cyclic shift, the matrices to be tested are the circulants we considered. 


42.6.3.4 Factorization of x” — 1 


The factorization of the polynomial z” — 1 over GF(2) can be used for the computation of A,. The file 
[FXT: |data/polfactdeg.txt| supplies the necessary information: 


# Structure of the factorization of x^n-1 over GF(2): 


1: [1] [1*1] 

2: [2] [1*1] 

3: [1] [1*1 + 1*2] 

4: [4] [1*1] 

5: [1] [1*1 + 1*4] 

6: [2] [1*1 + 1*2] 

7: [1] [1*1 + 2*3] 

8: [8] [1*1] 

9: [1] [1*1 + 1*2 + 1*6] 
10: [2] [1*1 + 1*4] 

11: [1] [1*1 + 1*10] 

12: [4] [1*1 + 1*2] 

13: [1] [1*1 + 1*12] 

14: [2] [1*1 + 2*3] 

15: [1] [1x1 + 1*2 + 3*4] 
16: [16] [1*1] 

17: [1] [1*1 + 2x8] 

An entry: n: [e] [mi*di + m2*d2 + ... ] says that (z" — 1) = P(x)* and P(x) factors into m1 


different irreducible polynomials of degree d1, m2 different irreducible polynomials of degree d2 and so 
on. As an example, for n = 6 we have 


2-1 = [8-1 = [0(z-z-1)] (42.6-7) 


xê is the square (e = 2) of a product of one irreducible polynomial of degree 1 and one of degree 2. 
Therefore we have the entry: 6: [2] [1*1 + 1*2]. Another example, n = 15, 


4-1] = [(@ +1) (a? +241) (a4 +241) (zf +2341) (ct +2342? cz 1)] (42.6-8) 


E 
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corresponding to the entry 15: [1] [1*1 + 1*2 + 3*4]. 


For the number of normal polynomials we have 


An = = II (: = 71) (42.6-9) 


Note that the quantity e does not appear in the formula. For example, with n = 6 and n = 15 we find 


96 1 1 1 I 64 1 3 
ee E a Ge eg iy 42.6-1 
"PESE "mS 
915 ps 135 pl 
A = -(1—-—).(1—-—) -[1—— = 42.6-1 
" x ( x) ( z) ( x) 675 (42.6-10b) 


42.6.3.5 Efficient computation 


It is possible to compute the number of degree-n normal binary polynomials without explicitly factorizing 
the polynomial x” — 1. We have 


z^—1 = [[Y«» (42.6-11) 
d\n 


where Ya(z) is the d-th cyclotomic polynomial (see section [40.11 on page 857). We further know that 
Ya(x) factors into y(d)/r polynomials of degree r where r = ordg(2) is the order of 2 modulo d. Let 
An := An/ (2, then a, can for odd n be computed as 


1 p(d)/r 
an = I (1-=) (42.6-12) 


The following GP code works all odd n: 


1  p-2 /* global */ 

2 num normal p(n)- 

3 

4 local( r, i, pp D; 

5 pp = 1; 

6 fordiv (n, d, 

7 r = znorder(Mod(p,d)); 
8 i = eulerphi(d)/r; 

9 pp *= (1 - 1/p*r)7i; 
0 ; 

1 return( pp ); 

2 } 


The number A, can be computed (for arbitrary n) as An = aq (2) where q odd and n = q2*: 


1  num_normal(n)= 

2 4 

3 local( t, q, pp ); 

4 t=1; q=0; 

5 while ( O--(qAp), q/=p; t+=1; ); 
6 /* here: n--q*p^t */ 

T pp = num normal p(q); 

8 pp *= p^n/n; 

9 return( pp ); 

0 P 


'The quantity t is not used in the computation. The implementation is quite efficient: the computation of 
A, for all n € 10,000 takes less than three seconds. The computation of A, for n = 1234567 = 127-9721 
(A, is a number with 371,636 decimal digits) takes about 200 milliseconds. 
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42.6.4 Dual and self-dual bases 


Let A= {ao, a1, 42, °°" 


, In—1} be a basis of GF(2”). A basis B = (bo, bi, ba, > 


Tr (akbj) = 0; for O<k,j<n 


, On—1} such that 


(42.6-13) 


is called the dual basis (or complementary basis, or trace-orthonormal basis) of A. A basis that is its own 
dual is called self-dual. We consider only normal basis here. If o is a root of a normal polynomial C, 


then A = (o, o2, at, --- 


,oQ? } is a normal basis. 


C 
ls “deat 
25, Ada cd 
3: 11111. 
4: . 11.11. 
bz. i11... 
6: 111. 
T: 11 
8: 11.111 
9: 1111.1 

10: 1111.. 
11: 11.1.1 
12: 4d1.1.. 
13: 111111 
14: 1111.. 
15: 11... 1 
16: 11.1... 
Lee Ade sd 
18: 111.11 
19: 1111.1 
20: 11..11 
21:- 11.11. 


T Cx D = T^-1 (mod x^n-1) 

...1 .11....111 P 1111.1.1.1 .1.1111.11 
LL, sobe id. dd -P 111....1.1  ..111111.1 
..11 .1.1111.11 P 1T1..111.11 —. e IPS a 
1.11 ..1.11.1.1 P 1111111.11 .111..1111 
.1.1 ..111111.1 P 11...1..11 aL. 11.11 
.1111 ..111111.1 P 11.::1.1.1 Lo 1d. 11 

T 11.2..111 1111.11..1  .1.1111.11 
cad. «uos fesad T. P -S 11.111.:11. s 1 
.1.1 .1.1111.11 P 11..11...1 .11....111 
1.11 .111..1111 P 11.1.11.11 ..1.11.1.1 
1.11 ..1.11.1.1 P 1111..1.11 .111..1111 
qutd — Ae Uuany v 1 Se cubos d laeva 1 
1.11 111..1111 P 11.11.1.11 ...1.11.1.1 
.111 .111..1111 P 11.11.11.1 ..1.11.1.1 
.1.1 1..11..11 P 111...1111 .111111.1 
HHE Lee 1 P'S 11.01..13111 . 1... 1 
1111 .1..11..11 P 111.111..1 ..111111.1 
1..1 ..111111.1 P 11...11111  .1..11..11 
1..1 .1.1111.11 P e ues 1 .11....111 
1.11 .11....111 P 11111...11 .1.1111.11 
11.1 ..1.11.1.1 P 1111...111 .111..1111 


Figure 42.6-H: All normal polynomials C of degree 9 and their polynomials T' (left), their duals C* 
and D = T^! (right). Primitive polynomials are marked with ‘P’, self-dual C = C* are marked with ‘S’. 


Let C be an irreducible polynomial, o a root of C, and define t; = Tr(a- a2") for 0 X k « n. We can 
compute the binary vector [to, t1, ..., t4 .1] as follows [FXT: bpol/normalpoly-dual.cc 
ulong 
gf2n xx2k trace(ulong c, ulong deg) 
// Return vector T of traces T[k]-trace(ek), 
// where ek = x*x^(2^k), k-0..deg-1, and 
X is a root of the irreducible polynomial C. 


// 
// Must have: 


{ 


} 


if ( c==3 ) 


return 1UL; 


const ulong tv 


ulong rt = 


ulong v = 0; 


2UL; 
const ulong h = 1UL << (deg-1); 


deg == degree(C) 


// x+1 is self-dual 


= gf2n_trace_vector_x(c, deg); // traces of x^k 
// root of C 
// aux 


for (ulong k=0; k<deg; ++k) 


{ 
ulong ek 


ulong tk 


x*x^(2^k) 
sum(ek[i]*tk[i]l) 


bitpolmod times x(rt, c, h); // -- 
gf2n fast trace(ek, tv); // == 


v |» (tk<<k); 
rt - bitpolmod square(rt, c, h); 


} 


return v; 


Now define the polynomial T' as 


The polynomial C is normal if and only if T has an inverse D = T~! mod (z^ — 1) 


1 


T = igit iau +... +tn1 27 


of C can be tested as follows [FXT: bpol/bitpol-normal.cc|: 


(42.6-14) 


. Therfore normality 
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bool 

bitpol_normal2_q(ulong c, ulong n) 

// Return whether polynomial c (of degree n) is normal. 
// Must have: c irreducible. 


const ulong t = gf2n_xx2k_trace(c, n); 
const ulong xni = (1UL<<n) | 1UL; // x^n-1 
return ( 1 == bitpol gcd(t, xnl) ); 


O oN DOI WHF 


} 


If D = T then basis under consideration is self-dual. But as D = T^! we must have T = 1. Therefore 
the following statements can be used to test whether the roots of C are a self-dual (normal) basis: 


1 // C == an irreducible polynomial of degree n 
2 ulong T - gf2n xx2k trace(C, n); 
3 if ( T==1 ) /* C is normal, its roots are a self-dual (normal) basis */ 


To compute the dual basis, write 


D = do tdya+dga?+...+dn-12" (42.6-15a) 
Then 6 defined by 
2 4 anc 
B = doat+dja*+dga*+...+dy_14 (42.6-15b) 
is a root of a normal polynomial C*, and B = (8, 82, 8*, ... , 8?" } is the dual (normal) basis of A. 
The following routine computes T', D, and C*: 
1  ulong 
2  gf2n dual normal(ulong c, ulong deg, ulong ntc/*-0*/, ulong *ntd/*-0*/) 
3  // Return the minimal polynomial CS for the dual (normal) basis 
4  // with the irreducible normal polynomial C. 
5  // Return zero if C is not normal. 
6  // Must have: deg == degree(C). 
7  // If ntc is supplied it must be equal to gf2n xx2k trace(c, deg). 
8  // If ntd is nonzero, ntc^-1 (mod x^deg-1) is written to it. 
9: 
10 if ( O--ntc ) ntc - gf2n xx2k trace(c, deg); 
11 const ulong d - bitpolmod inverse(ntc, 1 | (1UL««deg) ); // ntc-d^-1 (mod x^deg-1) 
12 if ( O==d ) return 0; // C not normal 
13 if ( O!=ntd ) x*ntd = d; 
14 
15 const ulong h = 1UL << (deg-1); // aux 
16 ulong alpha = 2UL; // ?x?, a root of C 
17 ulong beta = 0; // root of the dual polynomial 
18 for (ulong m=d; m!=0; m>>=1) 
19 1 
20 if (m&1) beta “= alpha; 
21 alpha = bitpolmod_square(alpha, c, h); 
22 
23 
24 ulong cs; // minimal polynomial of beta 
25 bitpolmod_minpoly(beta, c, deg, cs); 
90 return cs; 
28 } 


Figure|42.6-H| shows the normal polynomials C of degree 9 and the polynomials C*. It was created with 
the program [FXT: gf2n/normalpoly-dual-demo.cc|. A list of all polynomials with self-dual bases is given 
in section |40.9.13|on page 


42.6.5 The number of self-dual normal basis 


Figure gives the number S, of self-dual normal basis (top) and the number Z, of such basis 
where the field polynomial is primitive (bottom). The field polynomial is the minimal polynomial of 
any of the basis elements. The sequence of values S,, is entry in [312], the values Z, are entry 
No formula for the numbers Z,, is known, the values were computed with the program [FXT: 
. An expression for Sn is given in theorem 5]. The following routine 


or computing the values S, (for p = 2) is given by Max Alekseyev [priv. comm.]: 
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n: Sn ME 98 nt Sn n: Sn n Sh 
1: 1 9: 3 17: 17 25: 205 33 3267 
2: 1 10: 4 18: 48 26: 320 34 4352 
3: 1 11: 3 19: 27 27: 513 35 4095 
4: 0 12: 0 20 0 28: 0 36 0 
5: 1 13: 5 21: 63 29: 565 3T 7085 
6: 2 14: 8 22: 96 30: 1920 38: 13824 
7: 1 15: 15 23: 89 31: 961 39: 20475 
8: 0 16: 0 24 0 32: 0 AO 0 
n: Zn n: Zn n: Zn n: Zn n: Zn 

1: 0 9: 2 17: 17 25: 200 33: 2660 
2: 1 10: 3 18: 25 26: 215 34: 2917 
3: 1 11: 3 19: 27 27: 428 35: 
4: 0 12: 0 20: 0 28: 0 36: 
5: 1 13: 5 21: 5T 29: 562 3T: 
6: 1 14: 4 22: 60 30: 997 38: 
T: 1 15: 11 23: 8T 31: 961 39: 
8: 0 16: 0 24: 0 32: 0 40: 


Figure 42.6-I: Number of self-dual normal basis (Sn, top) and self-dual normal basis where the field 
polynomial is primitive (Z,,, bottom). No self-dual normal basis exists for n a multiple of 4. 


1  sdn(m,p)- 
2. WX Number of distinct self-dual normal bases of GF(p^m) over GF(p) where p is prime 
3 
4 local(F, f, g, s, c, d); 
5 if ( p==2 && m/4==0, return(0) ); 
6 
7 if ( !(mí%p), /* p divides m */ 
8 s = mp; 
9 return( p*((p-1)*(st+(s*(p+1))%2)/2-1) * sdn(s,p) ); 
10 , /* else */ 
11 F = factormod( (x^m - 1)/(x - 1), p 5; 
12 c=d= []; 
13 for (i=1, matsize(F) [1], 
14 f = lift(Fli,1]); 
15 g = polrecip(f); 
16 if ( f--g, c = concat( c, vector(F[i,2],j,poldegree(f)/2) ); ); 
17 if ( lex(Vec(f), Vec(g))==1 , 
18 d = concat( d, vector(F[i,2],j,poldegree(f)) ); 
19 J; 
20 ); 
21 return( 2^(pA2) * prod(i=1,#c, p^c[i] + 1) * prod(j=1,#d, p^d[j] - 1) / m ); 
22 35 
23 ] 


We note that duality is defined for any basis, but no self-dual polynomial basis exists. See and [232] 
for more information. Algorithms for the construction of self-dual normal bases are given in [229]. 


42.7 Conversion between normal and polynomial representation 


If the field polynomial C is normal, then conversion between the representations in polynomial and normal 
basis can be done as follows: Let Z be the n x n matrix whose k-th column equals 22 mod C where n 
is the degree of C. If a is the polynomial representation, then the normal representation is b = Z7! - a. 


The implementation [FXT: class GF2n in bpol/gf2n.h| allows the conversion to the normal representation 
if the field polynomial is normal. In the initializer the matrices Z (n2p. tab[]) and Z^! (p2n_tab[]) are 
computed with the lines 
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k = bin(k): f= g**k == (normal) trace (f) 

0. m uou : sessi == 11111 1 

l1 — wea seade CA ad 1 

2 — bL sales 555 isla 1 

3 — di: cheese SS s111. 1 

a ee 1:259 SS xl. 1 

5-3. : 111.1 -- 1.111 

6 — iil. : selil == 111... 1 

(141 : .111. == .11.1 1 P2N= 

SG = ves : 111.. SS? hiss 1 TE. 
9-.1..1: ..1.1 -- 111.1 . 1.11. 
10 = 1.1. % ¿dodo == 1111 . d 11 
11 = .1.11 : [los ==> adi . dais 
12 e 11. : 1-1.1 == d1..1 1 ios 
13 = .11.1 : 1.111 -- 11... 

14 = .111. : 1..11 11.1. 1 

15 = .1111 : 11.11 Lidl... : N2P- 
16 — 1... : ;1.11 lcs. 1 rdw 
e E 1.11. . .111 1 1...1 
18 = 1..1. : 1... 11.11 ! 4. 
19 2 1..11 : 11111 == 1.11. 1 nh 
20 2 Loli : ~eodd == 1111. . 11. 
21 = 1.1.1 : Uu EL aed dl 
22 = 1.11. : edi. SSP ay. : 
23 = 1.111 : dir. == 11.1. ; 
24 — 11... : .11.1 == 1..11 1 
25 = 11..1 : 11.1. -- .1.11 1 
26 = 11.1. : skerd: Dei d 
27 = 11.11 : dosh. =" octet : 
28 = 111.. : d1..1 == 1.1.1 1 
29 = 111.1 : ¿1111 == 1... 
30 = 1111. : 1111. == .1..1 


Figure 42.7-A: Conversion between normal and polynomial representation with the (primitive) poly- 
nomial c = 1 + z? + a? + z* + x. The conversion matrices are given as P2N— Z^! and N2P= Z. 


// conversion to and from normal representation: 
for (ulong k-0,s-2; k<n_; ++k) 


n2p_tab[k] = s; 
s = bitpolmod_square(s, c_, h_); 


bitmat_transpose(n2p_tab, n_, n2p_tab); 
is_normal_ = bitmat_inverse(n2p_tab, n_, p2n_tab); 


o IDO Ot i» C2 b2 


The last line records whether the field polynomial is normal (Z is invertible). 


The functions [FXT: bpol/gf2n.cc 


ulong // static 
GF2n::p2n(ulong f) 
{ return bitmat_mult_Mv(p2n_tab, n_, f); } 


ulong // static 
GF2n: :n2p(ulong f) 
{ return bitmat mult Mv(n2p tab, n , f); } 


YX O UA WN 


allow conversions between the normal and polynomial representations. To get the normal representation 
of a given element, use the method 


ulong get normal() const ( return p2n(v_); } 


This is demonstrated in [FXT: gf2n/gf2n-normal-demo.cc) where both the polynomial and the normal 
representation are given, see figure 


If the last argument of the initialization routine of the C++ class GF2n, init(n, c, normalg), is set, then 
a (primitive) normal polynomial will be used as field polynomial. A list of primitive normal polynomials 


is given in [FXT: bpol/normal-primpoly.cc!. 
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42.8 Optimal normal bases (ONB) 


The number of nonzero terms in the multiplication matrix determines the complexity (operation count) 
for the multiplication with normal bases. It turns out that for certain values of n there are normal bases 
of GF(2”) whose multiplication matrices have at most two nonzero entries in each row (and column). 
Such bases are called optimal normal bases (ONB). 


Optimal normal bases are especially interesting for hardware implementations because of both the highly 
regular structure of the multiplication algorithm and the minimal complexity with ONBs. 


42.8.1 Type-1 optimal normal bases 


A type-1 optimal normal basis exists for n when p := n + 1 is prime and 2 is ` primitive root modulo p 
(and for n = 0 and n = 1). The sequence of such n is (entry A071642 in 


0, 1, 12, 18, 28, 36, 52, 58, 60, 66, 82, 100, 106, 130, 138, 148, 
162, 2173. EUM '180, 196, 210, 226,’ 268, 292, 316, 346, 348, 372, 378, 388 
418; 420? 442, 460, 466, 490? 508, 


One has always n = 2 or n = 4 modulo 8. A list of the corresponding primes is given in figure |A41.7-B|on 
page [878] 'The field polynomial corresponding to a type-1 ONB is the all-ones polynomial 
zP—1 


= 1 = lrt? tr +... +2" (42.8-1) 
des 


The order of these polynomials is n + 1 (for n > 1) so they are non-primitive for all n > 3. 


Normal poly: c=11111111111 


A: A^-1: C^T: D= rud T*A^(-1): Mult. matrix M: 
"TE" 1111111111 dd Calera. a 1. 
O uidit us d cabo Sond ar tact viles Rucrde  “Swabodta gue BOK EM nm 1i. 
d$: d cetus 15a Desa 1 1 
— de. ia dl | E des eens l 
—! ¡AAA iue ee Seige eas doe te tes arias de 
1111111111 JUGAR gaaei 1. 1111111111 WA PME: 
o II Encdsua m eet 1 "ES "PU Load 
Seere 1 A drca ERN cd ee O 
oux PA e A 1 M DM rude bsc downs 
e Ne 1. eue MIHI 1111111111 poe n utere a croco 


Figure 42.8-A: Matrices that occur with the computation of the multiplication matrix for the field 
polynomial c= 1 4- x 4-... +x. 


'The multiplication matrices are sparse: there is one entry in the first row and column, and two entries in 
the other rows and columns. That is, the multiplication matrices for GF(n) with optimal normal basis 
have 2n — 1 nonzero entries. For example, with n — 10 we obtain the matrix shown at the right of 


figure|42.8-A| The equivalent data for n = 4 is shown in figure |42.6-A|on page 
42.8.2 Type-2 optimal normal bases 
A type-2 optimal normal basis exists for n if p := 2n + 1 is prime and either 

e n=l or n= 2 modulo 4 and the order of 2 modulo p equals 2 n. 


e n = 3 mod 4 and the order of 2 modulo p equals n. 
A type-2 basis exists for the following n < 200 (entry A054639 in [312]): 
11, 14, 18, 23, 26, 29, 30, 33, 35, 39, E 


61 59, 65 B, » m 516,561.89» 439 A ad a jui C 


The corresponding polynomials p,, (see figure |42.8-B| can be computed via the recurrence 


po :— l, prol (42.8-2a) 
Dk := Xpk-ictpk-2 (42.8-2b) 
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0: - 1 = 0: - 1 == 

1: [1] 11 ==x +1 1: [1] 11 ==1+x 
2: [1] 111 ==x"2+x+1 2: [1] 111 ==1+x + x72 
3: [1] 11.1 == x°3 + x72 +1 3: [1] 1.11 == 1 + x72 + x^3 
4: - 111.1 == x^4 + x^°3 + x72 + 1 4: - 1.111 == 1 + x^2 + x^°3 + x74 
5: [1] 11.111 == x°5 + x^°4 + x°2 + x * 1 5: [1] .11 

6: [1] 111..11 ==x"6+x"5+x"4 +x+1 6: [1] wad 

Te = 11,1,..1, == x^T + x^6, + E: ^4 + 1 Te = zela 

8: - 111.1...1 == x°8 + x7 + x°6+ x°44+1 8: - MM 

9: [1] 9: [1] sd 

10: - 10: - .1 

11: [1] 11: [1] 1. 

12:. > 12: == PEN 

13: = 13: = 

14: [1] 14: [1] 

15: = 15: = 

16: = 165: = 

17: = 17: - 

18: [1] 18: [1] 

19: - 19: = 

20: - 20: - 

21: = 217 = 

22i. -= 22: = 

23: [1] 23: [1] 

24: - 24: - 

25: = 25: = 

26: [1] 26: [1] 

27: = 27: = 

28: - 2809 = 

29: [1] 29: [1] 

30: [1] 30: [1] 

Siz = oie = 

32: = 32: = 


Figure 42.8-B: The polynomials p; as binary strings, high coefficients aligned left and right. The entry 
in the second column is ‘[1]’ if the polynomial is irreducible (a field polynomial for a type-2 ONB). 


Compare to the recursion that transforms a linear hybrid cellular automaton into a binary polynomial 
relation [41.8-1 on page 879| the type-2 ONBs correspond to the simplest LHCA defined by the rule 
having a single one as the lowest bit of the rule word. 


p/- i*x^7 + 1*x76 + 6*x75 + 5b*x^4 + 10*x^3 + 6*x^2 + 4*x + 1 


n\k0O 1 2 3 4 5 6 7 8 
0: 1 
1: 1 1 
2: 1 2 1 
3: 1 3 3 *1 
4: 1 4 *6 *4 1 
5: 1 *5 *10 10 5 1 
6: *1 *6 15 20 15 6 1 
Tz *1 T 21 35 35 21 T 1 
8: 1 8 28 56 70 56 28 8 1 


Figure 42.8-C: Locations (starred entries) of the coefficients of the polynomial py in Pascal’s triangle. 


Expressions for the polynomials p,, are 
E a njo S Gate j 
Pn = ; x = : x 42.8-3 
2. up A4 3 ec 


The locations of the coefficients of the polynomials p, in Pascal's triangle (figure on page 176) lie 
on a rising diagonal. For pz they are shown in figure |42.8-C| The following relations hold over GF(2) 
(but not over Z): 


Dn f/2n — ? = 2 (" = j 
Pr = Ez ae (42.8-4) 
2, xt 


The binomial coefficient (7) modulo 2 equals 1 if the binary expansion of k is a subset of the expansion 


of n. With the trick from section LE Tn page Fa we obtain a fast method for the computation of the 
polynomials pp (using the first equality in [42.8-4 
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1  t2poly(n) = sum(j=0,n, (bitand(2*n-j, j)==j)*x" (n-3)); 


Re 


The value of binomial coefficients modulo a prime q can be computed via the relation 


() = HIG] I 


where n =>), nj q) and k = Dj hs q? are the radix-q expansions, see [349] entry “Lucas Correspondence 
Theorem”] and also [141]. Moreover, the highest power of q that divides ey j equals the number of carries 
when subtracting k from n in base q, see [249]. Especially, if k; > n; for any j, then WE =0 mod q. The 
computation above is obtained by setting q — 2. 


We note a relation that connects the polynomials pz to the Fibonacci polynomials fj, defined by 


fo := 0, fi:1 (42.8-6a) 
fe := m facic fk- (42.8-6b) 

We have 
Pk = faa (42.8-7) 


As with type-1 ONBs, the multiplication matrices are sparse. The polynomials and multiplication ma- 
trices for n = 6 and n = 9 are 


6,5,4,1,0 9,8,6,5,4,1,0 
esp tact "p MM 
1 1 didis 
11 arg) em 
1:1 — — — es 1.1 
11 A. y. sss. 
1.1. — — — 3 11 
chee Mee dl ewe 
Lil 
E AE 


The intermediate values with the computation of the multiplication matrix for n = 5 are shown in 
figure [42.6-AJon page 

The sequence of values n such that an optimal normal basis (either type-1 or type-2) over GF(2”) exists 
is entry |A136250 in [312]. The values up to 100 are 


1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 14, 18, 23, 26, 28, 29, 30, 33, 35, 36, 39, 41, 50, 
51, 52, 53, 58, 60, 65, 66, 69, 74, 81, 82, 83, 86, 89, 90, 95, 98, 99, 100 


42.9 Gaussian normal bases 


The type-t Gaussian normal basis (GNB) generalize the optimal normal basis. The type-1 and type-2 
GNBs are the corresponding ONBs. The multiplication matrices for type-t GNBs for t > 2 have more 
nonzero entries than the ONBs. 


A type-t GNB exists for n if p := tn + 1 is prime and ged(n,tn/r3) = 1 where ra is the order of 2 modulo 
p. For n divisible by 8 no GNB exist. Figure |42.9-A] shows, for t < 10, the first values n such that a 


type-t GNB exists. The sequences for 1 € t € 7 are the folowing entries in [312]: A071642) (type-1), 
A054639 (type-2), A136415 (type-3), |A137310) (type-4), A137311 A137311 | (type- 5), A137313| (type-6), A137314 
A101284 using 


type-7), and 84 (type-8). We implement the test 


1 gauss_test(n, t)= 

2 1i /* test whether a type-t Gaussian normal basis exists for GF(2^n) */ 
3 local( p, r2, g, d 5; 

4 p = ten + 1; 

5 if ( !isprime(p), return( 0 ) ); 
6 if ( p<=2, return( 0 ) ); 

7 m znorder( Mod(2, p) ); 

8 = (t*n)/r2; 

9 : 7 gcd(d, n); 

0 return ( if ( 1==g, 1, 0) ); 

1 7 


42.9: Gaussian normal bases 915 


1: 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82, 100, 106, 130 

2: 1, 2, 3, 5, 6, 9, 11, 14, 18, 23, 26, 29, 30, 33, 35 

3: 4, 6, 12, 14, 20, 22, 46, 52, 54, 60, 70, 76, 92, 94, 116 

4: 1, 3, 7, 9, 13, 15, 25, 37, 43, 45, 49, 67, 73, 79, 87 

5: 2, 12, 20, 26, 36, 42, 84, 92, 98, 108, 114, 132, 140, 164, 188 
6: 1, 2, 3, 5, 6, 7, 10, 11, 13, 17, 23, 26, 27, 30, 33 

7: 4, 28, 30, 54, 60, 70, 78, 94, 100, 108, 118, 126, 166, 196, 214 
8: 5, 9, 11, 17, 29, 35, 39, 51, 65, 71, 77, 95, 101, 107, 117 

9: 2, 4, 18, 20, 34, 42, 44, 58, 60, 68, 82, 84, 92, 98, 124 

10: 1, 6, 7, 10, 13, 18, 19, 21, 27, 31, 42, 43, 46, 49, 54 


Figure 42.9-A: For t € {1,2,...,10}: lowest values n such that a type-t GNB exists for GF(2"). 


42.9.1 Computation of the multiplication matrix 


An algorithm to compute the multiplication matrix for a type-t GNB proceeds as follows (we use a vector 


F|1,2,...,p — 1]: 
1. Set p = tn + 1 (this is a prime) and compute an element r of order t modulo p. 
2. For k = 0,1,...,£— 1 do the following: set j = r? and for i = 0,1,...,n— 1 set F[j 2] = i. 
3. Set the multiplication matrix M to zero. 
4. For = 1,2,...,p — 2 add one to Mr, à, r(i1]- 
5. If t is odd, set h = n/2 and do the following: for i =0,1,...,h — 1 increment Mi nyi and Mri. 


Implementation in GP: 


gauss nb(n, t)- 
{ /* return multiplier matrix for type-t Gaussian normal basis */ 
/* returned matrix is over Z and has to be multiplied by Mod(1,2) */ 


local(p, r, F, w, x, nh, m, ir, ic); 


P 
r 


Ha 
o 
H 


tan + 1; 
znprimroot (p); 
vector(p-1); 
Mod(1, p); 
(k-0, t-1, 
j = lift); 
for (i20, n-1, 
F[j] = i; 
jt=j; if Gj>=p, j-=p); 


r= r^(n) 


); 
w *= r; 


, 


matrix(n, n); 


(i=1, p-2, 
ir = F[p-i]; ic = Fli+1]; 
m[ ir+1, icti ] += 1; 


if ( 1==(t%2), 


); 


nh = n/2; /* odd t 

for (i=0, nh-1, 
ir =i; ic 
ir += 1; ic 
m[ir, ic] += 
m[ic, ir] += 


nh + i; 
+= 1; 
1; 

1; 
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return ( m ); 


/* r has order t */ 


/* 2*j mod p */ 


==> even n */ 
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n-7, t-4 n-12, t-3 

M- M mod 2- - M mod 2- 
lias >» cb adds a Led e lis: E 
11.11 1.1..11 sever dee WML oases Peer eos PA 
si-111. 1.111. Codes els Ludere: 5s 
012-1. steele ads SEIL... 15. etd da 
Died a ell "M LL Z1 me 1.11. p 1.11 
l111..1 .111..1 Fe ae eee 1.1 lel 1.1 
1..111 .1..111 ll. mei ll. 3 9: 
IET ils 1 SIL. 12s 1 

P l.l. ——— sms 1. 

111.1... 111.1. 
lolslecbosl lc16e1::4 
sra gres 11.11 vied ceded dh 


Figure 42.9-B: Multiplication matrices over Z and GF(2) for Gaussian normal bases with n = 7, t = 4 
(left) and n = 12, t = 3 (right). Dots denote zeros. 


The implementation computes M with entries in Z, so M has to be reduced modulo 2 before usage. 
Figure |42.9-B] gives two examples. 


The file [FXT: lists for each 2 < n < 1032 the smallest ten values of t 
such that there is a type-t GNB of GF(2"). Note that different values of t do not necessary lead to 
different multiplication matrices, especially for small values of n. For example, the modulo 2 reduced 
multiplication matrices for n = 6 and the 10 smallest values of t are: 


t=2: t=3: t=6: t=10: t=11: t-23: t=27: t=30: t=35: t=55: 
beau cud. slo 2122. LLL, LL 21211. 211... alli. Dt. 
l.i. ullo. Led. 1.11.1 selle. liii di Mile Best. 1.111. 1.111. 
cells Mol. 21.1 TL. 11:..11  .1.—-1. oleske «llo 1.1 P SENG 
P Aa le A daa. 1.24. 1f... didi... dli lod dili 11. 
Vd eo a dro acs edad dl sect Ms del oa did..1 111..1 
soda Artati: (1d. 11 11.11 Iolclid tend A E old xad 

==t=6 ==1=3 ==t=23 ==t=2 ==t=23 ==t=23 


42.9.2 Determination of the field polynomial 


We give algorithms to compute the field polynomial corresponding to a given pair (n, t) such that a 
type-t GNB exists over GF(2”). A list of the polynomials for n < 63 and 1 € t € 11 is given in [FXT: 


data/gauss-normal-polys.txt |. 


42.9.2.1 Algorithm with complex numbers 


n-4 t-1: p=5 
a(1)=2 w(1)=(-0.809016994374947 + 0.587785252292473 I) 
a(2)=4 w(2)=(+0.309016994374947 - 0.951056516295154 I) 
a(3)=3 w(3)=(-0.809016994374947 - 0.587785252292473 I) 
a(4)=1 w(4)=(+0.309016994374947 + 0.951056516295154 I) 

z(x)=x°4 + x°3 + x2: x *1 

p(x)=x°4 + x°3+ x°2+x+1 

n=4 t=3: =13 
a(1)=2 w(1)=(-1.15138781886600 + 1.72542218842201 I) 
a(2)=4 w(2)=(+0.651387818865997 - 0.522415803456408 I) 
a(3)=8 w(3)=(-1.15138781886600 - 1.72542218842201 I) 
a(4)=3 w(4)=(+0.651387818865997 + 0.522415803456408 I) 

zZ(x)=x"4 + x73 + 2*x^2 - 4*x + 3 

p(x)=x°4 + x73 + 1 

n=4 t=7: p=29 
a(1)=2 w(1)=(-1.59629120178363 - 0.509187583844044 I) 
a(2)=4 w(2)=(+1.09629120178363 + 2.64399848798351 I) 
a(3)=8 w(3)=(-1.59629120178363 + 0.509187583844044 I) 
a(4)=16 w(4)=(+1.09629120178363 - 2.64399848798351 I) 

z(x)=x"4 + x73 + 4*x^2 + 20*x + 23 

p(x)=x°4 + x73 + 1 


Figure 42.9-C: Numerical values with the computation of the field polynomial for n = 4 and types 


t € (1, 3, 7}. Note that the final result is identical for the types t = 3 and t=7. 


The normal polynomial corresponding to a type-t Gaussian basis can be computed as follows: 


42.9: Gaussian normal bases 
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n-11 t-2: p=23 
a(1)-2 w(1)=(+1.7088388090929771051 - 1.175494350822287508 E-38 I) 
a(2)-4  w(2)=(+0.9201300754623042520 + 5.877471754111437540 E-39 I) 
a(3)-8 w(3)=(-1.1533606442297342825 - 5.877471754111437540 E-39 I) 
a(4)-16 w(4)=(-0.6697592243419723039 - 2.938735877055718770 E-39 I) 
a(5)-9 w(5)=(-1.5514225814088396141 + 1.763241526233431262 E-38 I) 
a(6)-18 w(6)=(+0.4069120261052675797 + 1.193861450053885750 E-39 I) 
a(7)-13 w(7)=(-1.8344226030109060357 + 1.028557556969501569 E-38 I) 
a(8)-3  w(8)=(+1.3651062864373081657 + 1.175494350822287508 E-38 I) 
a(9)-6 w(9)=(-0.1364848267293419518 - 1.205340887073634651 E-39 I) 
a(10)=12 w(10)=(-1.9813718920726615047 + 2.277520304718182046 E-38 I) 
a(11)-1 w(11)=(+1.9258345746955985900 - 2.938735877055718770 E-38 I) 


z(x)=x"11 + x^10 - 10*x^9 - 9*x^8 + 36*x^7 + 28*x^6 - 56*x75 NM 
- Sb*x^4 + 3b*x^3 + 15*x"2 - 6*x - 1 


p(x)=x"11 + x^10 + x^8 + x^4 + x^3 + 
n-11 t-6: p=67 
[--snip--] 
z(x)=x"11 + x^10 - 30*x^9 - 63*x^8 + 
- 1960*x*4 - 1758*x^3 - 3b*x^2 
p(x)=x"11 + x^10 + x^8 + x^5 + x^2 + 
n=11 t=8: p=89 
[--snip--] 
z(x)=x"11 + x^10 - 40*x^9 - 19*x^8 + 
+ 102*x^4 + 3152*x"3 - 781*x^2 
p(x)=x"11 + x^10 + x^8 + x°5 + x^2 + 


x^°2 + 1 


220*x^7 + 698*x^6 - 101*x75 \ 
+ 243*x - 29 
x t1 


482*x^7 + 84*x^6 - 2185*x75 \ 
+ 57*x - 1 
xt+i1 


n=11 t=18: p=199 
[--snip--] 
z(x)=x"11 + x^10 - 90*x^9 - 115*x^8 + 2349*x^7 + 943*x^6 - 26327*x"5 \ 
+ 21284*x^4 + 102168*x^3 - 217794*x^2 + 148930*x - 30647 
p(x)=x"11 + x^10 + x^8 + x^7 + x^6 + x^5b 1 


(00 — O Ou WN - 


10 


Figure 42.9-D: Numerical values with the computation of the field polynomial for n = 11 and types 


t € (2, 6, 8, 18}. The final results for t = 6 and t = 8 are identical. 


1. Set p = tn + 1 and determine r such that the order of r modulo p equals t. 
2. For 1 € k € n compute wj, = xu exp(a, 27 i/p) where ap = 2* rJ mod p. 
3. Let z(x) = [T4 (x — wx), this is a polynomial with real integer coefficients. 


4. Return the polynomial with coefficients reduced modulo 2. 


The computation of the polynomial z(r) uses complex (inexact) arithmetic. The coefficients should be 
close to real integers, which can be used as a check. The following GP routine computes the complex 


polynomial: 
gauss zpoly(n, t)- 
{ /* return field polynomial for type-t Gaussian normal basis 
as polynomial over the complex numbers */ 
local(p, r, wk, tki, tk, a, zp); 
p = n*t + 1; 
r = znprimroot(p)^n; \\ r has order t (mod p) 
zp = 1; 
tk1 = Mod(2,p); tk = Mod(1,p); 
for (k=1, n, 
tk *= tk1; WW == Mod(2,p)^k; 
wk = 0; 
a = tk; 
for (j=0, t-1, 
wk += exp(2.0*IxPi*lift(a)/p); 
a *- rj 
E 
zp *- (x-wk) ; 


); 


return ( zp ); 


} 


The final step uses GP’s function round() which rounds all coefficients of its polynomial argument: 


Re 


0 -1 O» Ot i» C2 b2 — 


O0 -1 O0» C' 4i MN 


me C 00-10» OC b.-L- 


918 


Chapter 42: Binary finite fields: GF(2") 


gauss poly(n, t)- 
{ /* return field polynomial for type-t Gaussian normal basis */ 


} 


The results for type-1 bases can be verified using relation |42.8-1 on page 912| results with type-2 bases 


2b| The values occurring with the computation for n = 4 and the types 
t € (1,3, 7} are shown in figure|42.9-C| the values for n = 11 and t € {2,6,8, 18} in figure |42.9-D 


The computation can be optimized by using a trigonometric recursion as described in section |21.3.2| on 


with relations |42.8-2a| and 


local(pp, zp); 

zp = gauss zpoly(n, t); 

pp = round(real(zp)); /* rounds all coefficients */ 
pp *= Mod(1,2); /* coefficients modulo 2 */ 

return( pp ); 


page We further exploit symmetry and use real values if the type t is even: 


vexp(p, t)= 
1 


} 


local( ve, ph, c, s, al, be, cp, sp, tt ); 
tt = 2.0*Pi/p; \\ angle increment 
c=1.0; s=0.0; ga= ph; al = 2.0*(sin(0.5*tt))^2; be = sin(tt); 
ve = vector(p); ve[1] = 1.0; 
if ( t&i, /* odd t, need complex values */ 
for (j=1, (p-1)>>1, 


tt = Cc; 

c -= (al*ttt+bexs) ; 

s -= (al*s -bextt); 
ve[jt1] = c + Iss; 
ve[p-j*1] = c - Ixs; 


23 
, /* even t: can use real values */ 
for (j=1, (p-1)>>1, 


tt = Cc; 

c -= (al*ttt+bexs) ; 
s -= (al*s -bextt); 
velj+1] =c+s; 
velp-j+1] = c - s; 


); 
); 


, 
return( ve ); 


The computation of the field polynomial needs two changes for even t: 


gauss_vpoly(n, t)= 


[--snip--] 
ve = vexp(p, t); \\ precompute trigonometric values 
[--snip--] 
wk += ve[lift(a)*1]; Al was: wk += exp(2.0*I*Pi*lift(a)/p); 
[--snip--] 


For odd t we only need n/2 of the loop iterations: 


for (k=1, nM2, \\ note: n/2 times 
tk *= tk1; \\ == Mod(2,p)^k; 
wk = 0; 
a = tk; 
for (j=0, t-1, 
wk += ve[lift(a)*1]; 
a *= rj 
); 
\\ use (x-(a*I*b))*(x-(a-I*b)) == x^2 - 2xa*x + (a^2 + b^2): 
zp *= (x^2-2*real(wk)*xtnorm(wk)); 
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Note that the polynomial is always real throughout the computation. 


42.9.2.2 Algorithm working in GF(2) 


The following algorithm is a variation of what is given in [340]. 
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n=4 t=3: p=13 \\ integer computation 


r= Mod(3, 13), ord(r)= t 

= x712 + x + x710 + x^9 + x8 + x77 + x6* xb + x4 + x°3 4+ x72 4+ x41 
MEM k=1 

= x^°9 + x°3 +x 

= x79 + x73 + 2*x 

> k=2 

= x76 + x75 + x72 

= x^11 x z 0 + x79 + x78 + 2*x^7 + 2*x^6 + x75 + x74 + 2*x^3 + 3xx"2 + x 
coex"il = SE —x O e = XS Se KD = x= i 

= -x"11 - x^10 - 2*x^ 9* tx s + 2*x^ 6 + x73 - 1 


= x^11 + x^8 + x°7 
= x^4 - x73 + 2*x^2 + 4*x + 3 


==> x74 - x73 + Q*x^2 + 4*x + 3 == x74 + x73 + 1 (mod 2) 


n-4 t-3: p=13 \\ computation over GF(2) 


==> 


r= Mod(3, 13) ord(r)=3 == t 
= x^12 + 11 + x710 +x =9 + x^8 + x^Y + x6 + xb + x°44+ x73 + x72+x4+1 
AR k=1 
=x°9+x°3+x 
= x79 + x73 + 2*x 
MEM k-2 
= x^ 6-4 x75 + x72 
= x^11 dius + x79 +x°8+x75+x°4+x°2+x 
= x^11 + x79 + x^8 + x1 + x^°6 + xb + x3 + x'2* x41 
= x7il E Au + x77 + x73 + 1 
= x^11 + x^8 + x°7 
= x^4 + x^°3 + 1 
x°4 + x73 + 1 


Figure 42.9-E: Computation of the field polynomial for n = 4 and t = 3 with polynomials over the 
integers (top) and polynomials over GF(2) (bottom). 


6. 


oF AN ES 


Set p = tn + 1 and determine r such that the order of r modulo p equals t. 
Set M = No | z^. All computations are done modulo M. 
If t equals 1, then return M. 
Set Fp = 1 (modulo M). 
Forl<k<n: 
(a) Set Zi, = D A at*,3) (modulo M) where a(k, j) = 2" ri mod p. 
(b) Set Fk = (x + Zk) Fi-1 (modulo M). 
Return Fy. 


The intermediate quantities in the computation for n = 4 and t = 3 are shown at the top of figure|42.9-E 
The result is a polynomial over the integers identical to the one computed with the algorithm that uses 


complex numbers. When all polynomials are taken over GF(2) the computation proceeds as shown at 
the bottom of figure |42.9-E| Implementation in GP: 


gauss_poly2(n, t)- 


1 


/* return field polynomial for type-t Gaussian normal basis */ 
local(p, M, r, F, t21, t2, Z); 


p = t*n + 1; 

r = znprimroot(p)^n; \\ element of order t mod p 
M = sum(k=0, p-1, ?x^k); \\ The polynomial modulus 
M *= Mod(1,2); NN ... over GF(2) 

if ( 1==t, return( M) ); \\ for type 1 


F = Mod(1, M); 
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12 t21 = Mod(2,p); t2 = Mod(1,p); 

13 for (k=1, n, 

14 Z = sum(j=0, t-1, Mod x^lift(t2*r^j), M) ); 
15 F = Cx*2)*F; 

16 t2 *= t21; 

17 ); 

18 return ( lift(F) ); 

19 > 


While the algorithm avoids inexact arithmetic, the polynomial modulus M is of degree p — 1 = nt which 
is large for large t. The computation with complex numbers is much faster in practice. It finishes in less 
than a second for n — 620 and t — 3 (and a working precision of 150 decimal digits), the exact method 
needs about two minutes. 


Using the redundant modulus x? — 1 = (a — 1)- M gives a significant speedup: 


1 gauss_poly2(n, t)- 

2 A 

3 local(p, M, r, F, t21, t2, Z); 

4 p = t*n + 1; 

5 r = znprimroot(p)^n; \\ element of order t mod p 

6 M = "x"p- 1; AN Use redundant modulus (instead of sum(k=0, p-1, ’x*k)) 
7 M *= Mod(1,2); MAN... over GF(2) 

8 

9 if ( 1==t, return( sum(k-0, p-1, ?x^k) ) ); AN for type 1 
10 

11 [--snip--] \\ main loop as before 

12 

13 \\ final reduction for redundant modulus: 

14 M = sum(k=0, p-1, ?x^k); \\ The polynomial modulus 

15 F = lift( Mod( lift(F), M) ); 

16 return ( F ); 

17 


Now computation of the polynomial for n = 620 and t = 3 takes less than nine seconds. The final 
reduction can be simplified by observing that no reduction is needed if the constant coefficient is one, 
else all coefficients just have to be negated. So the end of the routine can be changed to 


\\ final reduction for redundant modulus (simplified): 

F - lift(F); 

if ( O==polcoeff(F,0), F=sum(k=0, n, (1-polcoeff(F,k))*"x^k) ); 
return ( F ); 


OUI WON e 


“Ever tried. Ever failed. No matter. 
Try Again. Fail again. Fail better.” 


— Samuel Beckett 


921 


Appendix A 


The electronic version of the book 


The electronic version of this book is available free of charge at http://www. jjj.de/fxt/#fxtbook, it 


is identical to the printed version. 


Copyright and license 
Copyright © Jérg Arndt. 


The electronic version is distributed under the terms and conditions of the Creative Commons license 
“Attribution-Noncommercial-No Derivative Works 3.0”. You are free to copy, distribute and transmit 
this book under the following conditions: 


e Attribution. You must attribute the work in the manner specified by the author or licensor (but 
not in any way that suggests that they endorse you or your use of the work). 


e Noncommercial. You may not use this work for commercial purposes. 
e No Derivative Works. You may not alter, transform, or build upon this work. 


For any reuse or distribution, you must make clear to others the license terms of this work. The best 
way to do this is with a link to the web page below. Any of the above conditions can be waived if you 
get permission from the copyright holder. Nothing in this license impairs or restricts the author’s moral 
rights. 


For more information about the license, visit/|http://creativecommons.org/licenses/by-nc-nd/3.0/ 


How to make the hyperlinks work on your computer 


The hyperlink showing as [FXT: bits/revbin.h| points to file: .fxtdir/src/bits/revbin.h| To make 


this work on your machine you may want to create (in the directory where the viewer is started) a soft-link 
to the directory of the FXT sources. For example, assuming that the package is located at ~/work/fxt, 
execute the following statement: 


ln -sv ~/work/fxt ~/.fxtdir 
Similarly, for hfloat, do 
ln -sv ^/work/hfloat ~/.hfloatdir 


Test with the hyperlink [hfloat: src/hf/funcsrt.cc 


file:.hfloatdir/src/hf/funcsrt.cc 


For xdvi you may want to add the following lines to your file:~/.mailcap 


text/plain;/usr/bin/emacs -no-site-file 4s € 
text/x-csrc;/usr/bin/emacs -no-site-file %s € 
text/x-chdr;/usr/bin/emacs -no-site-file %s € 
text/x-c++src;/usr/bin/emacs -no-site-file %s € 
text/x-c++hdr;/usr/bin/emacs -no-site-file %s € 


which points to 


Here the editor emacs is used for viewing plain text, C and C++ sources and headers. 


Mozilla based browsers do not handle local links correctly, so you may want to use an alternative browser 
for this book, expecially with the pdf files. 
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Machine used for benchmarking 


The machine used for performance measurements is an AMD64 (Athlon64) clocked at 2.2 GHz with dual 
channel double data rate (DDR) clocked at 200 MHz (‘800 MHz’). It has 512 kB (16-way associative) 
second level cache and separate first level caches for data and instructions, each 64 kB (and 2-way 
associative). Cache lines are 64 bytes (8 words, 512 bits). The memory controller is integrated in the 
CPU. 


The CPU has 16 general purpose (64 bit) registers that are addressable as byte, 16 bit word, 32 bit word, 
or 64 bit (full) word. These are used for integer operations and for passing integer function arguments. 
There are 16 (128 bit, SSE) registers that are used for floating-point operations and for passing floating- 
point function arguments. The SSE registers are SIMD registers. Additionally, there are 8 (legacy, x87) 
FPU registers. 


The parts of the information reported by the CPUID instruction that are relevant for performance are: 


Vendor: AuthenticAMD 

Name: AMD Athlon(tm) 64 Processor 3500+ 
Family: 15, Model: 47, Stepping: 2 

Level 1 cache (data): 64 kB, 2-way associative. 
64 bytes per line, lines per tag: 1. 

Level 1 cache (instr): 64 kB, 2-way associative. 
64 bytes per line, lines per tag: 1. 

Level 2 cache: 512 kB, 16-way associative 
64 bytes per line, lines per tag: 1. 


Max virtual_addr width: 48 
Max physical addr width: 40 


Features: 
lm: Long Mode (64-bit mode) 


nx: No-Execute Page Protection 
mtrr: Memory Type Range Registers 
tsc: Time Stamp Counter 

fpu: x87 FPU 

3dnow: AMD 3DNow! instructions 
3dnowext: AMD Extensions to 3DNow! 
mmx: Multimedia Extensions 

mmxext: AMD Extensions to MMX 

sse: Streaming SIMD Extensions 
sse2: Streaming SIMD Extensions-2 
sse3: Streaming SIMD Extensions-3 
cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI) 


cx8: CMPXCHG8 instruction 

fxsr: FXSAVE and FXRSTOR instructions 

ffxsr: fast FXSAVE and FXRSTOR instructions 

lmlahf: load/store flags to ah (LAHF/SAHF) in 64-bit mode 


Special instructions as SIMD, prefetch and non-temporal moves are not used unless explicitly noted. 


See [169] for a comparison of instruction latencies and throughput for various x86 CPU cores. You do 
want to study the cited document before buying an x86-based system. 


The compiler used was the GNU (C and C++) compiler [146]. 
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The GP language 


We give a short introduction to GP, the language of the pari calculator [266]. From the manual page 
(slightly edited): 


NAME gp - PARI calculator 
SYNOPSIS gp [-emacs] [-f] [test] [-q] [-s stacksize] [-p primelimit] 
DESCRIPTION 


Invokes the PARI-GP calculator. This is an advanced programmable calcu- 
lator, which computes symbolically as long as possible, numerically 
where needed, and contains a wealth of number-theoretic functions 
(elliptic curves, class field theory...). Its basic data types are 
integers, real numbers, exact rational numbers, algebraic numbers, 
p-adic numbers, complex numbers, 

modular integers, 

polynomials and rational functions, 

power series, 

binary quadratic forms, 

matrices, vectors, lists, 

character strings, 

and recursive combinations of these. 


Interactive usage 


To use GP interactively, just type gp at your command line prompt. A startup message like the following 
will appear: 


GP/PARI CALCULATOR Version 2.3.4 (released) 
amd64 running linux (x86-64 kernel) 64-bit version 
compiled: Oct 14 2008, gcc-4.2.1 (SUSE Linux) 
(readline v5.2 enabled, extended help available) 


Copyright (C) 2000-2006 The PARI Group 


PARI/GP is free software, covered by the GNU General Public License, and 
comes WITHOUT ANY WARRANTY WHATSOEVER. 


Type ? for help, \q to quit. 
Type ?12 for how to get moral (and possibly technical) support. 


parisize = 8000000, primelimit = 500000 
? 


The question mark in the last line is a prompt, the program is waiting for your input. 


7 1+1 
41-72 


Here we successfully computed one plus one. Next we compute a factorial: 


? 44! 
42 = 2658271574788448768043625811014615890319638528000000000 


Integers are of unlimited precision, the practical limit is the amount of physical RAM. For floating-point 
numbers, the precision (number of decimal digits) can be set as follows 


? default (realprecision,55) 

43 = 55 

? sin(1.5) 

44 = 0.9974949866040544309417233711414873227066514259221158219 


The history numbers %N (where N is a number) can be used to recall the result of a prior computation: 
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g nA 
$5 = 0.9974949866040544309417233711414873227066514259221158219 


The output of the result of a calculation can be suppressed using a semicolon at the end of the command. 
'This is useful for timing purposes: 


? default (realprecision, 10000) 
45 = 10000 
? sin(2.5); 


To dH 
RK last result computed in 100 ms. 


The command ## gives the time used for the last computation. 
The printing format can be set independently of the precision used: 


? default (realprecision, 10000) ; 
? default (format ,"g.15"); 

? sin(2.5) 

46 = 0.598472144103956 


Command line completion is available, typing si, then the tab-key, gives a list of built-in functions whose 
names start with si: 


? si 
sigma sign simplify sin sinh sizebyte | sizedigit 


You can get the help text by using the question mark, followed by the help topic: 


? ?sinh 
sinh(x): hyperbolic sine of x. 


A help overview is invoked by a single question mark 


Help topics: for a list of relevant subtopics, type ?n for n in 
user-defined identifiers (variable, alias, function) 
Standard monadic or dyadic OPERATORS 

CONVERSIONS and similar elementary functions 
TRANSCENDENTAL functions. 

NUMBER T ET 


Functions related to general NUMBER FIELDS 
POLYNOMIALS and power series 

Vectors, matrices, LINEAR ALGEBRA and sets 
SUMS, products, integrals and similar functions 


GRAPHIC functions 
PROGRAMMING under GP 
The PARI community 


NFO © 00 N OUIWRW ID E OD 


mem 


Select a section by its number: 


2? 
e deriv eval factorpadic 
intformal padicappr polcoeff polcyclo 
poldegree poldisc poldiscreduced polhensellift 
polinterpolate polisirreducible  pollead pollegendre 
polrecip polresultant polroots polrootsmod 
polrootspadic polsturm polsubcyclo polsylvestermatrix 
polsym poltchebi polzagier serconvol 
serlaplace serreverse subst substpol 
substvec taylor thue thueinit 


You should try both of the following 

E pre ^ tutorial.dvi'. 

f displaying ^users.dvi'/. 

A short overview (which you may want to print) of most functions can be obtained via 


7? ??refcard e 
displaying ’refcard.dvi’. 


A session can be ended by either entering quit or just hitting control-d. 
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Built-in operators and basic functions 


There are the ‘usual’ operators +, -, *, /, ~ (powering), and % (modulo). The operator \ gives the integer 
quotient without remainder. The assignment operator is =. C-style shortcuts are available, for example 
t+=3 is the same as t=t+3. 


The increment by 1 can be abbreviated as t++, the decrement as t--. [Technical note: these behave as 
the C-language pre-increment (and pre-decrement), that is, the expression evaluates to t+1, not t. There 
is no post-increment or post-decrement in GP.] 


Comparison operators are ==, != (alternatively <>), >, >=, <, and <=. Logical operators are && (and), 
(or), and ! (not) 
Bit-wise operations for integers are 

bitand bitneg bitnegimply bitor bittest  bitxor 


and 


shift(x,n): shift x left n bits if n>=0, right -n bits if n<0. 
shiftmul(x,n): multiply x by 2^n (n»-0 or n«0) 


One can also use the operators >> and <<, as in the C-language, and the shortcuts >>= and <<=. 


An overview of basic functions is obtained as 


2? 

a List Mat Mod Pol Polrev Qfb 
Ser Set Str Strchr Strexpand Strtex Vec 
Vecsmall binary bitand bitneg bitnegimply bitor bittest 
bitxor ceil centerlift  changevar component conj conjvec 
denominator floor frac imag length lift norm 
norml2 numerator numtoperm  padicprec permtonum precision random 
real round simplify sizebyte sizedigit truncate valuation 
variable 


Here are a few: 


sign(x): sign of x, of type integer, real or fraction. 
max(x,y): maximum of x and y. 

min(x,y): minimum of x and y. 

abs(x): absolute value (or modulus) of x. 


floor(x): floor of x = largest integer<=x. 
ceil(x): ceiling of x=smallest integer>=x. 
frac(x): fractional part of x = x-floor(x) 


An overview of sums, products, and some numerical functions: 


? 79 
intcirc intfouriercos intfourierexp intfouriersin 
intfuncinit intlaplaceinv intmellininv intmellininvshort 
intnum intnuminit intnuminitgen intnumromb 
intnumstep prod prodeuler prodinf 
solve sum sumalt sumdiv,. . 
suminf sumnum sumnumalt sumnuminit 
sumpos 


For example: 


sum(X=a,b,expr,{x=0}): x plus the sum (X goes from a to b) of expression expr. 
prod(X=a,b,expr,{x=1}): x times the product (X runs from a to b) of expression. 


Basic data types 


Strings: 
? a-"good day!" 
"good day!" 
Integers, floating-point numbers (real or complex), and complex integers: 
? factor(239*5*I) 
[-I 1] 
[1 * I 1] 
[117 + 122*I 1] 


926 Chapter C: The GP language 


Exact rationals: 
? 2/3+4/5 
22/15 
Modular integers: 


? Mod(3,239)777 
Mod(128, 239) 
Vectors and matrices: 


? v-vector(5,j,j^2) 
[1, 4, 9, 16, 25] 
? m=matrix(5,5,r,c,r+c) 
[23 4 5 6] 
[34 5 6 7] 
[4 5 6 7 8] 
[567 8 9] 
[6 7 8 9 10] 


The vector is a row vector, trying to right-multiply it with the matrix fails: 


7 t-m*v . ; ; ; s 
2K impossible multiplication t_MAT * t_VEC. 


The operator ~ transposes vectors (and matrices), we multiply with the column vector: 
? t-m*v^ 

414 = [280, 335, 390, 445, 500]~ 

The result is a column vector, note the tilde at the end of the line. 


Vector indices start with one: 


? t[1] 
415 - 280 


Symbolic computations 


Univariate polynomials: 


? (1*x)^7 

x77 + T*x^6 + 21*x75 + 35*x74 + 3b*x^3 + 21*x^2 + T*x + 1 
? factor((1+x)”6+1) 

[x72 + 2xx + 2 1] 

[x4 + 4*x73 + 5xx"2 + 2xx + 1 1] 
Power series: 


? (1+x+0(x"4))77 

1 + 7T*x + 21*x^2 + 35*x73 + 0(x"4) 
? log((1+x+0(x"4))77) 

T*x - T/2*x^2 + T/3*x^3 + 0(x"4) 


Types can be nested, here we compute modulo the polynomial 1 + £ + z^ with coefficients over GF(2): 


? t=Mod(1+x, Mod(1,2)*(1+x+x"7))777 
Mod(Mod(1, 2)*x^3 + Mod(1, 2)*x + Mod(1, 2), Mod(1, 2)*x^7 + Mod(1, 2)*x + Mod(1, 2)) 
? lift(t) \\ discard modulo polynomial 
Mod(1, 2)*x^3 + Mod(1, 2)*x + Mod(1, 2) 
? lift(lift(t)) \\ discard modulo polynomial, then modulus 2 with coefficient 
xX33+x3+1 


Symbolic computations are limited when compared to a computer algebra system: for example, multi- 
variate polynomials cannot (yet) be factored and there is no symbolic solver for polynomials. 
An uninitialized variable evaluates to itself, as a symbol: 


? hello 
hello 


To create a symbol, prepend a tick: 
? w=3 
3 


? hello=’w /* the symbol w, not the value of w */ 
W 


Here is a method to create symbols: 


? sym(k)=eval(Str("A", k)) 
? t=vector(5, j, sym(j-1)) 
[AO, A1, A2, A3, A4] 


The ingredients are eval() and Str: 


eval(x): evaluation of x, replacing variables by their value. 
Str({str}*): concatenates its (string) argument into a single string. 


Some more trickery to think about: 


sym(k)=eval(Str("A", k)) 
t=vector(5, j, sym(j-1)); print("1: t=", t); 
{ for (k=1, 5, 

sy = sym(k-1); 


v = 1/k^2; 
/* assign to the symbol that sy evaluates to, the value of v: */ 
eval( Str( Str( sy ), "=", Str( v) ) 5; 

25: d 


print("2: t-", t); /* no lazy evaluation with GP */ 
t-eval(t); print("3: t-", t); 


'The output of this script is 
1: t=[A0, Al, A2, A3, A4] 


2: t=[A0, A1, A2, A3, A4] 
3: t=[1, 1/4, 1/9, 1/16, 1/25] 


More built-in functions 


'The following constants and transcendental functions are available: 


2? 

: Eier I Pi abs acos acosh 
asin asinh,. atan .. atanh,. bernfrac bernreal 
besselh2 besseli besselj besseljh besselk besseln 
cotan dilog eintí erfc eta exp 
hyperu incgam incgamc lngamma log polylog 
sinh sqr sqrt sqrtn tan tanh 
thetanullk weber zeta 


To obtain information about a particular function, use a question mark: 


? ?sinh 
sinh(x): hyperbolic sine of x. 
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agm arg 
bernvec besselhí 
cos cosh 
gamma gammah 
psi sin 


teichmuller theta 


Transcendental functions will also work with complex arguments and symbolically, returning a power 


series: 


? sinh(x) 
49 = x + 1/6*x^3 + 1/120xx"5 + 1/5040*x^7 + 1/362880*x^9 \ 


+ 1/39916800*x^11 + 1/6227020800*x^13 + 1/1307674368000*x^15 + O(x717) 


The line break (and the backslash indicating it) was manually entered for layout reasons. The ‘precision’ 


(that is default order) of power series can be set by the user: 


? default (seriesprecision,9) ; 
? sinh(x) 
411 = x + 1/6*x^3 + 1/120*x^b + 1/5040*x^7 + 1/362880*x^9 + 0(x^10) 


One can also manually give the O(a”) term: 
? sinh(x*0(x^23)) 
412 = x + 1/6*x^3 + 1/120*x^5 + 1/5040*x^7 + \ 
[--snip--] 
+ 1/121645100408832000*x^19 + 1/51090942171709440000*x^21 + 0(x^23) 


Functions operating on matrices are (type mat, then hit the tab-key) 


matadjoint matalgtobasis matbasistoalg matcompanion 
matdet matdetint matdiagonal mateigen 
matfrobenius mathess .. mathilbert mathnf 
mathnfmod mathnfmodid matid matimage 
matimagecompl matindexrank matintersect matinverseimage 
matisdiagonal matker matkerint matmuldiagonal 
matmultodiagonal  matpascal matrank matrix 
matrixqz matsize matsnf matsolve 


matsolvemod matsupplement mattranspose 
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Built-in number theoretical functions are 


2? 

; iddoriues bestappr bezout bezoutres bigomega binomial 
chinese content contfrac contfracpnqn core coredisc 
dirdiv direuler dirmul divisors eulerphi factor 
factorback factorcantor factorff factorial factorint factormod 
ffinit fibonacci gcd hilbert isfundamental ispower 
isprime ispseudoprime issquare issquarefree kronecker lcm 
moebius nextprime numbpart numdiv omega precprime 
prime primepi primes qfbclassno qfbcompraw qfbhclassno 
qfbnucomp qfbnupow qfbpowraw qfbprimeform gqfbred qfbsolve 
quadclassunit quaddisc quadgen quadhilbert quadpoly quadray 
quadregulator quadunit removeprimes sigma sqrtint zncoppersmith 
znlog znorder znprimroot znstar 


Functions related to polynomials and power series are 


2? 
i 6" deriv eval factorpadic 
intformal padicappr polcoeff polcyclo 
poldegree poldisc poldiscreduced polhensellift 
polinterpolate polisirreducible pollead pollegendre 
polrecip polresultant polroots polrootsmod 
polrootspadic polsturm polsubcyclo polsylvestermatrix 
polsym poltchebi polzagier serconvol 
serlaplace serreverse subst substpol 
substvec taylor thue thueinit 


Plenty to explore! 


Control structures for programming 


Some loop constructs available are 


while(a,seq): while a is nonzero evaluate the expression sequence seq. Otherwise 0. 
until(a,seq): evaluate the expression sequence seq until a is nonzero. 
for (X=a,b,seq): the sequence is evaluated, X going from a up to b. 


forstep(X=a,b,s,seq): the sequence is evaluated, X going from a to b in steps of s 
(can be a vector of steps) 


forprime(X-a,b,seq): the sequence is evaluated, X running over the primes between a 
and b. 


fordiv(n,X,seq): the sequence is evaluated, X running over the divisors of n. 
The expression seq is a list of statements: 


for ( kei, 10, stati; stat2; stat3; ) /* last semicolon optional */ 
for ( k=1, 10, stati; ) 
for ( k=1, 10, ; ) /* zero statements (do nothing, ten times) */ 


(The comments enclosed in /* */ were added manually.) 

The loop-variable is local to the loop: 

? for(k-1,10, ; ) /* do nothing, ten times */ 

a k /* not initialized in global scope ==> returned as symbol */ 


A global variable of the same name is not changed: 


f kar 
? for(k-1,3, print(" k=",k)) 
k=1 
k=2 
k=3 
? k 
7 /* global variable k not modified */ 


For the sake of clarity, avoid using global and loop-local variables of the same name. 


A loop can be aborted with the statement break(). The n enclosing loops are aborted by break (n). 
With next O, the next iteration of a loop is started (and the statements until the end of the loop are 
skipped). With break(n) the same is done for the n-th enclosing loop. 
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And yes, there is an if statement: 


if(a,seqi,seq2): if a is nonzero, seqi is evaluated, otherwise seq2. seqi and seq2 
are optional, and if seq2 is omitted, the preceding comma can be omitted also. 


To have more than one statement in the branches use semicolons between the statements: 
if ( a==3, /* then */ 

b=b+1; 

c=7; 
/* else */ 
b-1; 
0 


Non-interactive usage (scripts) 


Usually one will create scripts that are fed into gp (at the command line): 
EP -q < myscript.gp 

The option -q suppresses the startup message and the history numbers %N. 
If the script contains just the line 

exp(2.0) 


the output would be 
7.3890560989306502272304274605750078131 


To also see the commands in the output, add a default (echo,1); to the top of the file. Then the output 
is 
? exp(2.0) 
7.3890560989306502272304274605750078131 
You should use comments in your scripts, there are two types of them: 


\\ a line comment, started with backslashes 
/* a block comment 
can stretch over several lines, as in the C-language */ 


Comments are not visible in the output. With the script 


default(echo, 1); 
\\ sum of square numbers: 
s=0; for (k=1, 10, s-stk*k); s 


the output would be 


? default(echo,1); 
? s=0;for(k=1,10,s=s+k*k) ;s 
385 


Note that all blanks are removed (on input) and are therefore missing in the output. 


A command can be broken into several lines if it is enclosed within a pair of braces: 
{ for (k=1, 10, 
s=st+k*k; 
print(k,": s-", s); 
ees 
This is equivalent to the one-liner 
for (k=1, 10, s=stk*k; print(k,": s=", s); ); 


User-defined Functions 


Now we define a function: 


powsum(n, p)= 
{ /* return the sum 1^p*2^p43^p*...*n^p */ 
local(t); 
t=0; 
for (k=1, n, 
t = t+k"p); NV ’*? is the powering operator 
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return( t ); 


The statement local(t); makes sure that no global variable named t (if it exists) would be changed 
by the function. It must be the first statement in the function. The variable k in the for()-loop is 
automatically local and should not be listed with the locals. Note that each statement is terminated with 
a semicolon. The output would be 


? powsum(n,p)-10ocal(t);t-0;for(k-1,n,t-t*k^p);return(t); 


? powsum(10,2) 
385 


Note how the function definition is changed to a one-liner in the output. 
If you have to use global variables, list them at the beginning of your script as follows: 
global (var1, var2, var3); 


Any attempt to use the listed names as names of function arguments or local variables in functions will 
trigger an error. Note the use of global() is deprecated in later versions of GP. 


Arguments are passed by value. There is no mechanism for passing by reference, global variables can be 
a workaround for this. 

Arguments can have defaults, as in 

powsum(n, p=2)= /* etc */ 


Calling the function as either powsum(9) or powsum(9,) would compute the number of the first 9 squares. 
Defaults can appear anywhere in the argument list, as in 


abcsum(a, b=3, c)= return( atbtc ); 


So abcsum(1,,1) would return 5. 
All arguments are implicitly given the default zero, so the sequence of statements 


foo(a, b, c)» print(a,":",b,":",c); 
foo(,,) 


foo() 
foo 


will print three times 0:0:0. This feature is rarely useful and does lead to obscure errors. It will hopefully 
be removed in future versions of GP. 
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Beatty sequence 
Bell numbers [151] [358 
Bell polynomials 
Ben-Or test for irreducibility 
Berlekamp's Q-matrix feet 
Bhaskara equation [812 
big endian machine 
binary exponentiation 
binary finite field [804] 
binary GCD algorithm [767 
binary heap 
binary polynomial 
binary powering 
binary relation 
binary search 
binary splitting 

— for rational series, 

- vs. AGM, [656] 


— with continued fractions, [720] 
binary_debruijn (C++ lass) BUS Emm 


binary.necklace (C++ class) [373] 
Binet form, of a recurrence [674] [674] 
binomial coefficient 

— and type-2 ONB, 

— modulo a prime, 

— number of combinations, 
bit combinations 
bit counting 
bit subsets, via sparse counting [68] 
bit-array |164 
bit-array, fitting in a word [24] 
bit-block boundaries, determination [12] 
bit-reversal [34] 
bit-reversal permutation [118] 
bit-wise 

— reversal, 

— rotation, 

— zip, 
bit_fibgray (C++ class) 
bit_necklace (C++ class) 
bit_subset (C++ class) [68] 
bit subset gray (C++ class) |69} [69] 
bitarray (C++ class) |164| [164] 
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bitrev permutation [118] 

BITS_PER_LONG 

bitset, testing for subset 

blocks of bits, counting 

blocks of bits, creation 

blocks, swapping via quadruple reversal [124] 
blue code [49] [377] 

blue code, fixed points [53] 

bracelets, as equivalence cu 
branches, avoiding them [25] 

bsearch [41] 

bswap instruction 

built-ins, GCC 

butterfly diagram, for radix-2 transforms [460] 
byte-wise Gray code and parity 
BYTES_PER_LONG 


C++ class XYZ see XYZ (C++ class) 

C2RFT see real FFT 

C2RFT (complex to real FFT) 

carries with mixed radix counting |220 

carry, in multiplication 

CAT, constant amortized time [173] 

catalan (C++ class) 

Catalan constant 

Catalan numbers 

Catalan objects (combinatorial structures counted 
by Catalan numbers) [323 

Cayley numbers 

Cayley-Dickson construction 

characteristic polynomial 

— of a matrix, 

— of a recurrence relation, |666| 

— with Fourier transform, 
characteristic, of a field [886] 

Chase’s sequence, for combinations [190] 
Chebyshev polynomials 

— (definition), [676 

— and Pell’s equation, 

— and products for the square root, 

— and recurrence for subsequences, 

— and square root approximants, 

— as hypergeometric functions, 

~ fast computation, [680] 

— with accelerated summation, 
checking pair, of arctan relations 
Chinese Remainder Theorem (CRT) 
Chinese Remainder Theorem, for convolution [542] 
chirp z-transform 
circuit in a graph 
circulant matrix [447] [905] 
Clausen's product 


CLHCA (cyclic LHCA) 
clz (Count Leading Zeros), GCC built-in [21] 
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co-lexicographic order 

— (definition), [172 

— for combinations, |177 

— for compositions, 

- for permutations, 

— for subsets of a multiset, 

— with bit combinations, 
colex (co-lexicographic) order 
comb rec (C++ class) 
combination chase (C++ class) 
combination colex (C++ class) 
combination emk (C++ class) 
combination enup (C++ class) 
combination lex (C++ class) 
combination mod (C++ class) 
combination pref (C++ class) 
combination revdoor (C++ class) 
combinations (k-subsets) of a multiset 296 
combinations, Gray code, with binary words [66] 
combinations, of k bits 
combinatorial Gray TI 
companion matrix 
comparison function, for sorting |145 
compiler, smarter than you thought 
complement, of a permutation 
complement-shift sequences 
complementary (dual) basis 
complementing the sequency of a binary word [48] 
complete elliptic integral see elliptic integral 
complete graph 
complex numbers, construction [804] 
complex numbers, mult. via 3 real mult. 
complex numbers, sorting |146 
composite modulus 
compositeness of an integer, test for [786] 
composition of permutations [108 
composition, of permutations |105 
composition colex (C++ class) 
composition colex2 (C++ class) 
composition ex colex (C++ class) 
compositions |194 
computation of 7, AGM vs. binary splitting |656 
conditional search, for paths in a graph [398] 
conference matrix |386 
conjugates of an element in GF(2") |892 
connected permutation [103] 281] 
connected permutation, random [117] 
connection polynomial [864] 
constant 

— Catalan, [663] 

= Cone scaling, [647] [650] 

— Fibonacci parity, [755| vss 

— Golay-Rudin- 2 S [732] 

— Gray code, [7 


- GRS, [732] 
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— Komornik-Loreti, 
— paper-folding, 
— parity number, 
— Pell, [758 
— Pell Gray code, 
— Pell palindromic, 
— period-doubling, 
— rabbit, 
— revbin, 
— Roth's, 
— ruler, 
— sum of Gray code digits, |744 
— sum-of-digits, |740 
- Thue, 
— weighted sum of Gray code digits, 
- weighted sum-of-digits, 
constant amortized time (CAT) 
contiguous relations, for hypergeometric series[689] 
continued fraction [716 
continued fractions, as matrix products [720] 
convergent, of a continued fraction [716] 
conversion, float to int [6] 
conversion, of the radix (base) 
convolution 
— acyclic (linear), 443 
— and Chinese Remainder Theorem, |542 
— and circulant matrices, 
— and multiplication, 
— AND-convolution, 
— by FFT, without revbin permutations, [442] 
- by FHT, 
— cyclic, 
— cyclic, by FHT, |525 
— dyadic, 
— exact, |542| 
— ms 
~ mass storage, [153] 
— MAX-convolution, 
— negacyclic, 
— OR-convolution, 
- property, of the Fourier transform, [441] 
- right-angle, 
— skew circular, 
— weighted, 
— XOR-convolution, [481] 
cool-lex order see prefix shifts or 
Cooley-Tukey FFT algorithm [412] 
coprime: a coprime to b <= > gcd(a,b) = = 1/535] 
copying one bit [7] 
CORDIC algorithms 
correlation |444 
and circulant matrices, [447] 
cosine and cosh, as hypergeometric function 


cosine, by rectangular scheme 
cosine, CORDIC algorithm 
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cosine, in a finite field 
counting bits of a sparse word [20] 
counting bits of a word [18] 


counting sort 
coupled iteration, for the square root 


CPU instructions, often missed 
CRC (cyclic redundancy mB 
crc32 (Ct+ class) 
crc64 (C++ class) 
Creutzburg-Tasche primitive root 
cross-correlation |445 
CRT (Chinese Remainder Theorem) [772 
ctz (Count Trailing Zeros), GCC built-in [21] 
cube root extraction [569 
cubic convergence 
cycle in a graph |391 
cycle type, of a permutation [278] 
cycle-leaders 

— of the Gray permutation, 
cycles, of a permutation [104] 
cyclic convolution 

— (definition), [440 

— computation via FFT, [442 
cyclic correlation [444] 
cyclic distance, with binary words [32] 
cyclic group 
cyclic LHCA (CLHCA) 
cyclic period, of a binary word [30] 
cyclic permutation 
cyclic permutation, random [112] 
cyclic permutations 
cyclic redundancy check (CRC) 
cyclic XOR [32] 
cyclic_perm (C++ class) 
cyclotomic polynomials 

— (definition), [704 

— and primes, [802 

— and primitive binary polynomials, 


Daubechies wavelets [547 
De Bruijn graph [395] 
De Bruijn sequence 
— (definition), |873 
— for computing bit position, 
— lex-min DBS via necklaces, |377 
— number of ane 
— with subsets, 
De Bruijn sequence 
— as path in a graph, 
debruijn (C++ class) 
decimation in frequency (DIF) FFT [414 
decimation in time (DIT) FFT 412 
delta sequence [447| 
delta set 
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delta squared process [598] 
demo-programs, and timing|175 

deque (C++ class) 

deque (double-ended queue) |158 
derangement 

derangement order, for permutations [264] 
derangement, random 

DFT (discrete Fourier transform) |410 
DIF (decimation in frequency) FFT |414 
difference sets, and correlation [447] 
digraph 

digraph (C++ class) 

digraph paths (C++ class) 
directed graph [391] 

discrete Fourier transform (DFT) [410 
discrete Hartley transform |515 


DIT (decimation in time) FFT 
divides: d\n means “d divides n? 


division 

— algorithm using only multiplication, [567] 

- CORDIC algorithm, 

— exact, by C = 2* +1, 

— exact, with polynomials over GF(2),|826 
divisionless iterations for polynomial roots |586 
divisors (C++ class) 
divisors of n, sum of e-th powers, d.(n) [708 
Dobinski’s formula, for Bell numbers 
double buffer, for mass storage convolution [454] 
double-ended queue (deque) |158 
dragon curve sequence [90] 
dual basis 
dyadic convolution [481] 


Dyck words 
— k-ary, 1333 
— binary, [323 


dyck gray (C++ class) 
dyck gray2 (C++ class) 
dyck rgs (C++ class) 


E, elliptic integral [603] 

Eades-McKay sequence, for combinations |183 
easy case, with combinatorial ra on LT 
edge of a graph [391] 

edge sorting, with graph search 
EGCD (extended GCD) 

EGCD, to compute Padé approximant 
EGF (exponential generating function) 
eigenvectors of the Fourier transform 
eight-square IDE 

element of order n 

elementary functions, as hypergeometric f. 
elliptic E (complete elliptic integral) [603 
elliptic K (complete elliptic integral) [601 
elliptic integrals, as hypergeometric functions [700] 
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endian-ness, of a computer [2] 
endo (Even Numbers DOwn) order 
endo order, for mixed radix numbers 
enup (Even Numbers UP) order 
enup order for combinations 
enup order, with permutations 
equivalence classes |148 
equivalence relation |148 
equivalence relations, number of 
Eratosthenes, prime sieve 
eta function (7-function) |344| 
Euclidean algorithm [767] 
Euler numbers [282] 
Euler’s identity, for hypergeometric functions [689] 
Euler's totient function [776 
exact convolution [542 
exact division 
exact division, by C = 2^ +1 
exact division, with polynomials over GF(2) |826 
expect (with branch prediction), GCC built-in 
exponent, of a group 
exponential oon tT 
exponential function 

— as hypergeometric function, [696] 

— bit-wise computation, 

— by rectangular scheme, [660] 

— computation via q = exp(—7 K'/K), 

— iteration for, 

— of power series, 
exponential generating function (EGF) 
exponentiation 

— algorithms, 

— modulo m, |766 


extended GCD (EGCD) 


extended GCD, to compute Padé approximant [595] 
extension field [804] 

external algorithms 

extraneous fixed point, of an iteration [593] 


factorial number system 232] 

factorial numbers, and cyclic permutations [289] 
factorial, binsplit algorithm for 
factorial, rising factorial power [|685 
factorization of binary polynomials [858] 
falling factorial [176] 

falling factorial base [232] 

fast Fourier transform (FFT) [411 

fast Hartley transform (FHT) [515 

fcsr (C++ class) 

FCSR (feedback carry shift register) |876 
feedback carry shift register (FCSR) 
Fermat numbers [795] 

Fermat primes 


Ferrers diagram (with integer partitions) |345 


ffs (Find First Set), GCC built-in [21] 


FFT 
— as polynomial evaluation, [559] 
- radix-2 DIF, 416 
— radix-2 DIT, 
— radix-4 DIF, 
— radix-4 DIT, 
— split-radix algorithm, 

FFT (fast Fourier transform) |411 


FFT caching [564] 
FFT, for multiplication [558] 
FFT-primes 
FHT 
— convolution by, 


— DIF step, |519 
— DIF, recursive, 
— DIT, recursive, 
— radix-2 DIF, 
— radix-2 DIT, 
— radix-2 DIT step, |516 
— shift operator, 
FHT (fast Hartley transform) [515] 
Fibbinary numbers 


Fibonacci 
— k-step sequence, [309] 
— parity, [754 
— parity constant, 
— polynomials, 
— representation, 
- setup, of a shift register, [867] 
— words, 


— words, Gray code, 306 

— words, shifts-order, 
Fibonacci numbers [754] 
Fibonacci-Haar transform [512 
Fibonacci-Walsh transform |513 
field polynomial [886] 
FIFO (first-in, first-out), queue 
filter, for wavelet transforms 
finite field 
finite field, binary [886] 
finite field, with prime modulus [776] 
Fisher-Yates shuffle 
fixed point, of a function [587] 


fixed points, of the blue code [53] 
FKM algorithm [371] 

FKM algorithm, for binary words [30] 
four step FFT |438 


four-square identity 
Fourier shift operator 
Fourier transform 


Fourier transform, convolution property |441 
fractional (order) Fourier transform |533 


fractional Fourier transform [456] 


fixed point, of an iteration, extraneous 
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free element (normal element) |901 
full path in a graph 


Galois field 
Galois setup, of a shift register [867] 
Gauss’ transformation [690] 
Gaussian normal basis [914 
GCC, built-ins 
GCD (greatest common divisor), computation[767] 
generalized subset-lex (gslex) order [224] 
generating functions, for combinatorial objects[173] 
generator in GF(2") [889 
generator of a group 
generator, modulo p 
generator, program producing programs [531] 
GF(2") (binary finite field) |886 
GF2n (C++ class) 
GNB (Gaussian normal basis) 
Golay-Rudin-Shapiro constant 
Golay-Rudin-Shapiro sequence |44| 
Goldschmidt algorithm [581] 
Gray code 

— and radix —2 representations, 

— binary, reversed, 

— combinatorial (minimal-change order), 

— constant, 

— for combinations, 

— for combinations of a binary word, 

— for Fibonacci words, [76] 

— for Lyndon words, 

— for mixed radix numbers, 

— for multiset permutations, |301 

— for Pell words, 

— for sparse signed binary words, [315 

— for subsets of a bitset, 

— for subsets, with shifts-order, 

— of a binary word, 

— powers of, 

— single track, 
Gray permutation 
gray cycle leaders (C++ class) 
greatest common divisor (GCD), computation[767] 
green code 
ground field [804] 
group, cyclic 
GRS (Golay-Rudin-Shapiro) sequence [44] 
GRS constant 
gslex order, for mixed radix numbers 


Haar transform [497] 

Hadamard matrix 817 
Hadamard transform 

half-trace, in GF(2") with n odd [898] 
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Halley's formula |588| 
Hamiltonian cycle 
Hanoi, towers of, puzzle [735] 
Hartley shift 
Hartley transform 
hashing, via CRC 
heap [160] 
Heap's algorithm for permutations [248] 
heapsort 
Heighway dragon 
hexdragon [95] 
high bits of a word, operations on [14] 
Hilbert curve 
— function encoding it, 
— moves, 
— turns, 
homogeneous moves, with combinations [183] 
homogenous moves, with k-subsets 215] 
Householder's iteration [592 
Householder’s method 
hybrid linear cellular automaton (LHCA) 
hyperbolic sine and cosine, by CORDIC 
hypercomplex numbers 
hypergeometric function 
— (definition), [685 
— AGM algorithms, 
hypergeometric series 
— (definition), [685 
— conversion to continued fraction, 
- transformations, [688] 


identical permutation [102] 
i! = exp(—7/2), computation [627] 
in-place routine [413] 
indecomposable permutation 
indecomposable permutation, random [117] 
index of an element modulo m [175 
index of the single set bit in a word [13] 
index sort 
infinite products, from series |709 
inhomogeneous recurrence 
initial approximations, for iterations [575] 
integer partitions [339 
integer sequence 
— 1’s-counting seq., 
— Beatty seq. with d, 
— Bell numbers, 358 
— Carmichael numbers, 
— Catalan numbers, |331| 
— connected permutations, 
— dragon curve seq., |90| 
— Euler function y(n), [776 


— Euler numbers, 
- F-increment RGS, 
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— Feigenbaum symbolic seq., — A000009, 
— Fibbinary numbers, — A000010, 
- Fibonacci numbers, — A000011, 
— fixed points in lex-rev seq., — A000013, 408 


— Golay-Rudin-Shapiro seq., 731 — A000029, [151 

— Gray codes, — A000031, 380 
- GRS (Golay-Rudin-Shapiro) seq., — A000041, 

— hypercomplex multiplication, — A000043, 

— indecomposable permutations, |281 — A000045, 
— integer partitions, |345 — A000048, 
— involutions, [279] — A000073, 

— irreducible polynomials, — A000078, 

— irreducible self-reciprocal polynomial, — A000079, 

— irreducible trinomials, — A000085, 

- K-increment RGS, — A000108, 
- Kronecker symbol (= ,[745] — A000110, 358| 
- Lyndon words, — A000111, 

— max-increment RGS, — A000120, 739 
— Mephisto Waltz seq., A — A000123, 

— Moser - De Bruijn sequence, — A000129, 
— necklaces, — A000166, 

— non-generous primes, |780 — A000201, 

— number of XYZ, see number of, XYZ — A000203, 

— optimal normal bases, type-1, — A000213, 

— optimal normal bases, type-2, a — A000255, 

— paper-folding seq., — A000288, 

— paper-folding seq., signed, — A000296, 

— paren words, — A000322, 

— partitions into distinct parts, — A000383, 

— partitions, of an integer, — A000593, 

— Pell equation not solvable, — A000695, [59] [750] 
— period-doubling seq., — A000700, 

— primes with primitive root 2, — A001037, 843 
— primitive roots of Mersenne primes, |373 — A001045, 318 
— primitive trinomials, — A001122, 

— quadratic residues all non-prime, — A001220, 

— rabbit seq., — A001227, 

— radix —2 representations, |60| — A001262, 

— restricted growth strings, |337 — A001318, 

- ruler function, [733] — A001333, 758 
— sparse signed binary words, — A001470, 

— subfactorial numbers, [280] — A001511, |9| [733 

— subset-lex words, — A001591, 

— sum of binary digits, [739 — A001592, 


- sum of digits of binary Gray code, 
— swaps with revbin permutation, 


— Thue-Morse seq., 1726] 


— trinomial, irreducible, — A002450, 
— type-1 optimal normal bases, — A002475, 
— type-2 optimal normal bases, — A002812, 
— values of the Möbius function, — A002997, 
— Wieferich primes, — A003010, 
integer sequence, by OEIS number — A003106, 


— A000003, — A003114, 350 
108 


— A000005, — A003188, 
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— A003319, — A029883, 

— A003462, — A031399, 

— A003622, — A034448, 

— A003688, — A034947 

— A003714, [6 ees [756] — A035263, [T0] [735] 
— A003849 [7 — A035457 

— A004211, — A036991 

— A004212, — A045687, 

— A004213, — A046116, 

— A005011, — A046699, 

— A005351, [60] — A048146, [353] 
— A005352 — A048250, 

— A005418 733 — A051158, 

— A005578, — A054639, 914 
— A005614, — A055578, 

— A005727, — A055881, 

— A005797, — A057460, 

— A005811, [90] 1744 — A057461, 

— A006130, — A057463, 

— A006131, — A057474, 

— A006206, — A057496, 

— A006498, — A061344, 

— A006519 — A064990, 

— A006945, — A065428, 

— A007895, — A067659, 

— A008275 — A067661 

— A008277, — A069925, 

— A008683, — A071642, 912] [914 
— A010060, [24] |727 — A072226, [802] 
— A011260, — A072276, 

— A014565, — A073571, 

— A014577, aea — A073576, 

— A014578, — A073639, 

— A015440, [3 x — A073726, 885 
— A015441 — A074071 

— A015442, — A074710, 

— A015443, — A079471, 

— A015448, — A079559, 

— A015449, — A079972, 

— A016031, — A080337 

— A019320, — A080764, 

— A020229, — A080846, 

— A020985, [2411732 — A086347 

— A022155, — A087188, 

— A022342, — A091072, 

— A025157 — A095076, 

— A025158, — A096393, 

— A025159, — A100661, 

— A025160, — A101284, 

— A025161, — A102467, 
— A025162, — A103314, 

— A027187 — A104521, [759 
— A027193, — A106400, 
— A027362, — A106665, [90 


— A028859, — A107220, 
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— A107222, inversion table, of a permutation [232] 

— A107877, invertible modulo m 

— A108918, involution (self-inverse permutation) |106} |279 
— A110981, involution, random 

— A114374, irreducible polynomial [837 

— A118666, isolated ones or zeros in a word [11] 

— A118685, iteration 

— A123400, — and multiple roots, [593] 

— A125145, - divisionless, for polynomial roots, [586] 
— A134337, — for exp, 

— A134345, — for se Bm 

— A135488, [909] — for inverse cube root, 

— A135498, — for inverse root, 

— A136250, — for inverse square root, 

— A136415, — for logarithm, 

— A136416, — for roots modulo p”, |569 

— A137310, — for the zero of a function, [587] 

— A137311, — Goldschmidt, 

— A137313, — Householder’s, 


— A137314, — Schróder's, 
— A138933, — synthetic, 
— A143347, — to compute 7, [615] 
— A159880, Ives! algorithm for permutation generation [270] 
— A162296, 
= A164896, J is n i AE 
— A167654, Jacobi matrix 548 
— A175337, Jacobi's identity |605 
— A175338, K 
- A175339, K, elliptic integral [601] 
— A175390, E 
- A176405, i Dyck words 
- A176416, g Wo 
interleaving process k-compositions of n[94 
~ for set partitions, [354] k-permutations [291] 
— for Trotter's permutations, k-subset [210] 


interpolation binary search 


k-subsets (combinations) of a ]tiset [296 
interpolation, linear [142] M (combinations) multi 


Karatsuba multiplication 


inverse i [550] 
— for integers, |550 
— additive, modulo m, [767 = i ee 
- by exponentiation, keys, sorting by keys 
- cube root, iteration for, [569] Kùuth shuffle 
- in GF(Q), Komornik-Loreti constant 


— iteration for, 
— modulo 2” (2-adic), 
— modulo m, by exponentiation, [781] 


Kónig iteration functions 
kperm gray (C++ class) 
Kronecker product 


— multiplicative, modulo m, — (definition), [463 
-ofa circulant matrix, — of Hadamard matrices, 
H permutation, [106] i Kronecker symbol 
— permutation, in-place computation, [106] ksubset gray (C++ class) 


— power series over GF(2), [826 

~ root, iteration for, [573] 

- square root, iteration for, [568] 

— XYZ transform, see XYZ transform 


ksubset rec (C++ class) 
ksubset twoclose (C++ class) 
Kummer’s transformation [693 


inversion formula, Lagrange |590 | 
inversion principle, Móbius Lagrange inversion formula 
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Lambert series 


LCM, least common multiple 


left inversion, of a permutation 
left-right array [166] 
left-right array, with Lehmer code [235] 
left-to-right powering |564 
left right array (C++ class) 
Legendre symbol 
Legendre's relation [604] 
Lehmer code, of à permutation 232] 
lex (lexicographic) order [172] 
lexicographic order 

— (definition), [172 

— for bit combinations, 

— for combinations, 

— for multiset permutations, 

— for subsets, |202 

— for subsets of a binary word, 

— generalized, for mixed radix numbers, 
lfsr (C++ class) 
LFSR (linear feedback shift register) |864 
LFSR, and Hadamard matrices [384] 
LHCA (linear hybrid cellular automaton) |878 
LIFO (last-in, first-out), stack [153] 
LIMB (super-digit) 
linear convolution 
linear correlation [444 
linear feedback shift register (LFSR) 
linear hybrid cellular automaton (LHCA) 
linear interpolation [142] 
linear, function in a finite field [887] 
Lipski's Gray codes for permutations 
list recursions [304 
little endian machine [2] 
localized Hartley transform algorithm [529 
localized Walsh transform algorithm 
logarithm 

— as hypergeometric function, |696| 

— bit-wise computation, 

— computation by rectangular scheme, 


— computation via AGM, 

- computation via 7/ O Em) 

— curious series for, 

— iteration using exp, [623 

— of power series, [630] 
logarithmic generating function (LGF) 
logical shift, of a binary word [3] 
long division [567] 
long multiplication 
loop in a graph 
loopless algorithm 
low bits, operations on [8] 


LR-array (left-right array) |166 
Lucas test, for primality [799 
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Lucas-Lehmer test, for Mersenne numbers [796] 
lucky path, in a graph [402] 
Lunnon's Gray code for multiset permutations 301] 
Lyndon words 
— (definition), [370 
- and irreducible binary polynomials, [856] 
— and Mersenne primes, 
— binary, number of, [380 
— number of, [380 
- with even/odd weight, 
— with fixed content, 
— with fixed density, |382 


Lyndon words, as binary word 
lyndon_gray (C++ class) 


m-sequence 
MAC (modular adjacent changes), Gray code [399] 
mass storage convolution [453] 
matrix Fourier algorithm (MFA) 
matrix square root, li ons Baa} 
matrix transposition, and zip permutation [127] 
matrix transposition, in-place [122] 
MAX-convolution [491] 
maximal order modulo m [774] 
mean, prere 
median of three elements 
Mephisto Waltz sequence [730] 
merge sort 
Mersenne numbers, Lucas-Lehmer test |796 
Mersenne primes 

— 23-th roots, 

— and Lyndon words, 

— generalized, 

- Lucas-Lehmer test, 
Mersenne-Walsh transform |513 
MFA (matrix Fourier algorithm) |438 
MFA convolution algorithm 
minimal polynomial, in erm 
minimal-change order see Gray code 
minimum, among bit-wise rotations 
missing, CPU instructions [82] 
mixed radix numbers [217] 
mixedradix endo (C++ class) 
mixedradix endo gray (C++ class) 
mixedradix gray (C++ class) 
mixedradix gslex (C++ Aes PUR 
mixedradix gslex alt (C++ class) 
mixedradix lex (C++ class) 
mixedradix modular gray (C++ class) 
mixedradix modular gray2 (C++ deer PAR 
mixedradix sod lex (C++ class) 
Mobius function [705 
Mobius inversion principle [706 


mod (C++ class) 


Index 


mod m FFTs [535 
modular adjacent changes (MAC), Gray code [399] 
modular arithmetic [764 
modular multiplication 
modular reduction, with structured primes [768] 
modular square root 
modulo, as equivalence classes [149] 
modulus 
— composite, |776 
— prime, 
- prime, with NTTs, [535] 
moment conditions, for wavelet filters [546] 
Moser — De Bruijn sequence 
moves, of the Hilbert curve 
mpartition (C++ class) 
mset perm.gray (C++ class) 
mset perm lex (C++ class) 
mset perm lex rec (C++ class) 
multi-dimensional Walsh transform 
multi-point iteration [597] 
multigraph 391] 
multinomial coefficient 
multiple roots, iterations for 
multiplication 
— by FFT, |558 
— carry, [558] 
— integer vs. float, |6| 
— is convolution, 
- Karatsuba, 
— modulo m, [765 
— of complex numbers via 3 real mult., 
— of hypercomplex numbers, 
— of octonions, 
— of polynomials, is linear convolution, [444] 
— of quaternions, 
— sum-of-digits test, 
multiplication matrix, for normal bases [901] 
multiplication table, of an algebra [815] 
multiplicative function 
multiplicative group [777 
multiplicative group, with a ring|775 
multiplicative inverse, modulo m|767 
multisection of power series [688] 
multiset 


N-polynomial (normal polynomial) 
n-set, a set with n elements 

NAF (nonadjacent form) i] 

NAF, Gray code [315] 

necklace (C++ class) 
necklace2bitpol (C++ class) 
necklaces 


- as equivalence classes, [149] 
— binary, 
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— binary, number of, 
— definition, [370] 

- with even/odd weight, 
— with fixed content, 


neighbors of a node in a graph 
Newton’s formula|895 
Newton’s iteration, for vector-valued functions[548] 
node (vertex) of a graph 
non-generous primes 
non-residue (quadratic, modulo p)[781 
nonadjacent form (NAF) 
nonadjacent form (NAF), Gray code [315] 
normal bases, for GF(2”) [900] 
normal basis, optimal 
normal element (free element) [901 
normal polynomial [900] 
NTT 
— (number theoretic transforms), [535 
— radix-2 DIF, [538 
— radix-2 DIT, [537 
— radix-4, [540 
number of 
— alternating permutations, [281] 
— aperiodic necklaces, [380 
— binary necklaces, 
— binary partitions of even numbers, 
— binary reversible strings, 
— binary words with at most r consecutive 
ones, |308 
B bracelet TE] 
— carries, |220 
— combinations, 
— connected permutations, 
— cycles in De Bruijn graph, [397 
— De Bruijn sequences, |874 
— derangements, [280] 
— divisors, 
— equivalence relations, 
— F-increment RGS, 
— fixed density Lyndon words, 
— fixed density necklaces, 
— generators modulo n, 
- increment-i RGS, 
— indecomposable permutations, [281] 
— integer partitions, |345 
— integers coprime to m, [776] 
— inversions of a permutation, 
— invertible circulant matrices, |905 
- involutions, [279] 
— irreducible polynomials, 
— irreducible SRPs, 
- K-increment RGS, 
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- Lyndon words, |380| 


— m-sequences, [873 
- max-increment RGS, 


— necklaces, |150} , [150] [379] 


— necklaces with even/odd weight, [382] 

— normal polynomials, [904| [904] 

— ones in binary Gray $5 

~ parentheses pairs, [331] 

— partitions into distinct parts, 

— partitions of a set, [358 

— partitions of an integer, 

— permutations of a multiset, 

— permutations whose order divides 3, |280 
— permutations with m cycles, 

— primitive normal polynomials, 

— primitive polynomials, 

— primitive SRPs, 

— self-dual normal bases, 

— set partitions, 
— shift register sequences, 
— sparse signed binary words, 
— strings with fixed content, 
— swaps with revbin permutation, [118] 
— units in GF(Q),|888 
- units modulo m, 
— unlabeled bracelets, 
— unlabeled necklaces, 


— zero-divisors of Cayley-Dickson algebras, 


[820] 


number theoretic transforms (NTT) [535 


O(1) algorithm 
octonions [816 
OGF (ordinary generating function) [173 
ONB (optimal normal basis) 912 
one-point iteration [587] 
optimal normal basis (ONB) 
optimization, with combinatorial generation [174] 
OR-convolution [489 
OR-convolution, weighted 
order 

— of a polynomial, 

— of an element modulo m, 

— of an iteration, [587] 
ordinary generating function (OGF) 
out of core algorithms [453] 


P rn tati 
Padé approximants 
— for arctan, 


— for exp, 
— for the r-th root, 
— for the logarithm, 
— for the square root, 


Index 


paper-folding sequence 
parallel assignment (with pseudocode) 
parameters, of a ea cue series |685 
paren (C++ class) |3 
paren gray (C++ 21250) 52 [829] 
parentheses, and binary words [78] 
parity 

— number, 

— of a binary word, 

— of a permutation, |105 

- random permutation with given parity, [112 
parity (parity of a word), GCC built-in [21] 
Parseval's equation [411] 
partial unrolling of a loop 
partition 

— of a set, 

~ of an integer, [339] 
partition (C++ class) 


partition gen (C++ class) 
partitioning, for quicksort 
Pascal's triangle 

path in a e 

pattern, length-n with p letters [861] 
perc64 (C++ class) 


Pell 
— constant, [758] 
— equation, 
— Gray code constant, |762 
— palindromic constant, [759 
Pell ruler function [760 


Pell words, Gray code [313] 


pentagonal number theorem |346 
pentanomial 

Pepin's test, for Fermat numbers [795] 
period of a polynomial 
period-doubling constant 
period-doubling sequence |10| [735] 
perm colex (C++ class) 
perm derange (C++ class) 
perm gray ffact (C++ class) [|259| 
perm gray ffact2 (C++ class) 
perm gray. lipski (C++ class) 
perm gray. rfact (C++ class) 
perm gray.roti (C++ class) 
perm gray.wells (C++ class) 
perm heap (C++ class) 

perm heap2 (C++ eee E 
perm heap2 swaps (C++ class) 
perm involution (C++ E 
perm ives (C++ class) 
perm lex (C++ class) 
perm mvO (C++ class) 
perm rec (C++ class) 
perm restrpref (C++ class) [278] [278] 
perm rev (C++ class) [245] [245] 


Index 


perm_rev2 (C++ class) 
perm rot (C++ class) 
perm_st (C++ class) 
perm_st_gray (C++ class) 
perm_star (C++ class) 
perm_star_swaps (C++ class) 
perm_trotter (C++ class) 
perm_trotter_lg (C++ class) 
permutation 
— alternating, [281 
— as path in the complete graph, 
— composition, 
— connected, 
— cycle type, 
~ cycles, [104] 
— cyclic, random, 
— derangement, 
— indecomposable, 
— inverse of, 
— involution, 
— of a multiset, 
— random, [111] 
— with m cycles, number of, 
~ with prescribed parity, random, [112] 
Pfaff's reflection law [689 
phi function, number theoretic [776] 
7, computation [615 
pitfall, shifts in C 
pitfall, two's complement [4] 
Pocklington-Lehmer test, for primality [794 
pointer sort 
pointer, MET 
polar decomposition, of a matrix [578] 
polynomial 
— binary, |822 
— irreducible, 
— multiplication, is linear convolution, 
— multiplication, splitting schemes, 
— primitive, 
— roots, divisionless iterations for, 
— weight of a binary polynomial, 
popcount (bit-count), GCC built-in 
power series 
- computation of exponential function, [631] 
— computation of logarithm, [630 
~ reversion, [589] 
powering 
— left-to-right method, 
— modulo m, 
- of permutations, [108] 
— of the binary Gray code, 
— right-to-left method, 
Pratt’s certificate of primality 
prefetching a memory location, GCC built-in [21] 
prefix convolution 
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prefix shifts, order for 

— combinations, 

— multiset permutations, 

— paren strings, |330 
prefix transform 
prime length FFT, Rader's algorithm [457] 
primes 

— and cyclotomic polynomials, [802] 

- as modulus, [776] 

- as modulus, with NTTs, [535] 

— non-generous, |/ 80 

— sieve of Eratosthenes, 

— structured, 

— Wieferich, 

— with primitive root 2, 
primitive n-th root, modulo m 
primitive r-th root of unity 
primitive elements, of a group 
primitive polynomial [841] 
primitive root 

— (definition), [776 

~ Creutzburg-Tasche, [808] 

— in GF(2”), [889 

— of Mersenne primes, 

— with prime length PET mm 
primitive trinomial [848] 
priority queue 
priority_queue (C++ class) 
product form 


— for a-th root, 
— for continued fractions, 


— for elliptic K, 


— for power series of exp, [631 

— for square root, 

— from series, 
products of k out of n factors|179 
products, infinite, from series [709 


Proth's theorem [795 
Prouhet-Thue-Morse constant 


pseudo graph 
pseudo-inverse, of a matrix 
pseudoprime 


pseudoprime, strong (SPP) 


Q-matrix [858] 

quadratic convergence [587] 

quadratic equation, with binary finite fields [896] 
quadratic reciprocity [782 

quadratic residue (square) modulo p|781 
quadratic residues, and Hadamard matrices [386] 
quadruple reversal technique [124] 

quartic convergence 


quaternions 
queue (C++ class) 
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queue (FIFO) 
quicksort 


R2CFT see real FFT 
R2CFT (real to complex FFT) |431 
R5-dragon 
R7-dragon 
rabbit constant 
rabbit sequence [513] 
Rabin's test for irreducibility [838] 
Rabin-Miller test, for compositeness [787] 
Rader's algorithm, for prime length FFT [457] 
radix —2 (minus two) representations [58] 
radix (base) conversion [656] 
radix permutation 
radix sort 
radix-r DIF FFT step 
radix-r DIT FFT step 
random permutation 
random selection 
ranking, with combinatorial objects [172] 
rational, square root iterations 
re-orthogonalization, of a matrix 
real FFT 

— by FHT, [523 

— split-radix algorithm, 

— with wrap routines, 
reciprocal polynomial |845 
reciprocity, quadratic [782 
rectangular scheme 

— for arctan and log,|658 

— for exp, sin, and cos, [660] 
recurrence 


- o Erg 

— inhomogeneous, 

- relation, 

— relation, for subsequences, 
red code 
reduction 

- modular, with structured primes, [768] 

— modulo z? +z +1 etc., [806] 
Reed-Muller transform 

— (definition), 

— and necklaces, 
rejection method 
relation, binary 
representations, radix —2 (minus two) 
representatives, with equivalence classes [149 
residue (quadratic, modulo p) |781 
restricted growth strings (RGS) 

- (definition), [325] 

— for parentheses strings, 


— for set partitions, 
revbin (bit-wise reversal) 


Index 


revbin constant 
revbin pairs, via shift registers [873] 
revbin permutation [118] 
revbin permutation, and convolution by FFT [442 
reversal bit-wise 
reversal, of a permutation 
reversed arithmetic transform 
reversed Gray code [45] 
reversed Gray permutation P31 
reversed Haar transform 
reversed Reed-Muller transform [487] 
reversing the bits of a word [34] 
reversion of power series 
— (definition), [589 
— for Schróder's formula, 
— with k-ary Dyck words, 
RGS (restricted growth string) 
rgs fincr (C++ class) 
rgs maxincr (C++ SD S 
right inversion, of a permutation 232] 
right-angle convolution 
right-to-left powering p 
ring buffer 
ringbuffer (C++ class) 
rising factorial base [232] 
RLL (run-length limited) words 
Rogers-Ramanujan identities 
root 
- extraction, [572] 
— inverse, iteration for, 
— modulo p” (p-adic), 
— of a polynomial, divisionless iterations, [586] 
— primitive, 
— primitive, in GF(2”), [889 
— primitive, modulo m, 
— primitive, of Mersenne primes, 
roots of unity, having sum zero [383] 
rotation, bit-wise 
rotation, by triple reversal [123] 
row-column algorithm [437] 
ruler constant 
ruler function [207] 


ruler func (C++ class) |207| 
run-length limited (RLL) words 


Sande-Tukey FFT algorithm 
Sattolo’s algorithm 
scalar multiplication 
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Schróder's formula 

search, binary [141] 

searching, with unsorted arrays 
secant method 

sedenions [816 

selection sort 


Index 


selection, of a random element |111 
self-correlation |445 

self-dual (basis over GF(2")) |908 
self-inverse permutation, random 
self-reciprocal polynomial 
sentinel element 

sequence see integer sequence 


sequency [474] 


sequency of a binary word, complementing 


sequency, of a binary word [46] 
set partition [354] 
setpart (C++ class) 
setpart p.rgs lex (C++ class) 
setpart rgs gray (C++ class) 
setpart rgs lex (C++ class) 
shift operator, for Fourier transform 
shift operator, for Hartley transform |516 
shift register sequence (SRS) 
shift-and-add algorithms 
shifts in C, pitfall 
shifts, and og 
shifts-order 

— for bit combinations, 

— for Fibonacci words, 

~ for subsets, [208] 

~ Gray code, for subsets, [209] 
short division [567] 
short multiplication 
sieve of Eratosthenes [770 
sign decomposition, of a matrix [579] 
sign of a permutation [105] 
sign of the Fourier transform 
signed binary representation 


signed binary words, sparse, Gray code 


simple continued fraction 
simple path in a graph 
simple zero-divisors 


sine and sinh, as hypergeometric function 


sine, CORDIC algorithm [646] 
sine, in a finite field [808] 
single track 

— Gray code, 

— order for permutations, 

— order for subsets, [208] 
singular value decomposition (SVD) |577 
singular values, with elliptic 
skew circular convolution |451 
slant transform [482] 
slant transform, sequency-ordered [483] 
smart, your compiler [26] 
sorting by keys [144] 
sorting, edges in a graph [402] 
sparse counting, and bit subsets [68] 
sparse signed binary representation [61] 


sparse signed binary words, Gray code 
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sparse words, bit counting 
spectrum of a real number [756 

SPI (strong pseudo-irreducible) |839 
split-radix FFT algorithm [425] 
splitting schemes for multiplication 

- for integers, [550] 

— for polynomials over GF(2), |8277 
splitting, binary, for rational series [651 
SPP (strong pseudoprime) |786 
square modulo p [81] 
square of a permutation [107] 
square root 


- of a matrix, applications, [576] 
square-free factorization, with polynomials [863] 
square-free polynomials [858] 
square-free, partitions into such parts [351] 

SRS (shift register sequence) [864 

stable sort 

stack (C++ class) 

stack (LIFO) 

star-transposition order, for permutations [257] 
Stirling numbers 

— of the first kind (cycle numbers), [277 

— of the second kind (set numbers), [358 
strings with fixed content 
strong minimal-change order [172| 
strong minimal-change order for combinations[183] 
strong pseudo-irreducible (SPI) 
strong pseudoprime (SPP) 
structured primes 
subdegree of a polynomial [852] 
subfactorial numbers 
subsequences, recurrence relations for [672] 
subset convolution |493 
subset of bitset, testing 
subset debruijn (C++ class) 
subset deltalex (C++ class) 
subset gray (C++ class) 
subset gray delta (C++ class) 
subset lex (C++ class) 
subsets 

— of k bits (combinations), 

— of a binary word, |68} 

— of a multiset, 
subtraction, modulo m 
sum of digits, with mixed radix numbers [229] 
sum of Gray code digits constant 
sum of two squares 
sum-of-digits constant 
sum-of-digits test, with multiplication [562] 
sumalt algorithm [662] 


966 


sums of divisors, and partitions [352] 
super-linear iteration [587] 
SVD (singular value decomposition) [577 
Swan’s theorem [850 
swapping blocks via quadruple reversal [124] 
swapping two bits [8] 
swapping variables without temporary [6] 
symmetries 

— of the Fourier transform, 

— of the revbin permutation, 
synthetic iterations 


taps, with wavelet ME 
tcrc64 (C++ class) 
tensor product 
terdragon curve 
theta functions: O5, O5, and ©, [604] 
Thue constant 
Thue-Morse sequence [44] [461] 
thue morse (C++ class) 
timing, with demo-programs [175] 
TMFA (transposed matrix Fourier algorithm) [438 
toggle between two values 
Toom-Cook algorithm |551 
Toom-Cook algorithm for binary polynomials [831] 
totient function [776 
towers of Hanoi [735 
trace 

— of a polynomial, 

— of an element in GF(2?), [887 

— vector, fast computation, |895 

— vector, in finite uS 
trace-orthonormal basis 
transformations, for elliptic K and E 
transformations, of hypergeometric series 
transforms, on binary words H9] 
transition count, for a Gray code 403] 
transposed matrix Fourier algorithm (TMFA) 
transposition of a matrix, and zip permutation|127 
transposition of a matrix, in-place 
transpositions of a o LAT 
trigonometric recursion 
trinomial, primitive 
triple reversal technique 
Trotter's algorithm for permutations [254] 
two’s complement, pitfall [4] 
two-close order for k-subsets [215] 
two-close order for combinations [188 
type, of a set partition [359] 
type-1 optimal normal basis 
type-2 optimal normal basis 
type-t Gaussian normal basis 


unitary divisor [353] 
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units (invertible elements) |767 

universal cycle (for combinatorial objects) |874 
unlabeled bracelets [150] 

unranking, with combinatorial objects [172] 
unrolling, of a loop 

unsorted arrays, searching [147] 


unzip permutation 
vertex, of a graph|391 
vertical addition 


Walsh transform [459] 

Walsh transform, multi-dimensional [461] 
wavelet conditions [544 

wavelet filter [544] 

wavelet transform [543 

wavelet filter (C++ class) [544] 
weight, of binary polynomial 
weighted arithmetic transform 
weighted convolution 
weighted MFA convolution algorithm 
weighted MFA convolution, mass storage 
weighted OR-convolution [493] 

weighted sum of Gray code digits constant 
weighted sum-of-digits constant [741] 

weighted transform [448] 

Wells’ Gray code for permutations [252] 


Whipple’s identity |690} 
Wieferich primes |780 


XOR permutation 
XOR, cyclic 
XOR-convolution |481 
x”, series for [702] 
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yellow code 


Young diagram (with integer partitions) |345 


Z-order 

z-transform |454 

Zeckendorf representation [754] 
zero bytes, finding [55] 

zero divisors, of an algebra [815] 
zero padding, for linear convolution [444] 
zero-divisors of the sedenions [820 
zero-one transitions in a word [12] 
zip permutation 

zip, bit-wise 

z^, series for [702 


